Conversion - cihai.conversion

Conversion functions for various CJK encodings and representations.

Notes

Original methods and docs based upon ltchinese, license MIT Steven Kryskalla.

New in version 0.1: Python 2/3 compatibility.

  • PEP8, PEP257.

  • int() casting for comparisons

  • Python 3 support.

  • Python 3 fix for ucn_to_python().

  • Python 3 __future__ statements.

  • All methods converting to _python will return Unicode.

  • All methods converting Unicode to x will return bytestring.

  • Add ucnstring_to_python()

  • Any other change upon @ conversion.py @9227813.

The following terms are used to represent the encodings / representation used in the conversion functions (the samples on the right are for the character U+4E00 (yi1; “one”)):

GB2312 (Kuten/Quwei form)

“5027” [used in the “GB2312” field of Unihan.txt]

GB2312 (ISO-2022 form)

“523B” [the “internal representation” of GB code]

EUC-CN

“D2BB” [this is the “external encoding” of GB2312-

ISO2022’s “internal representation”; also the form that Ocrat uses]

UTF-8

“E4 B8 80” [used in the “UTF-8” field in Unihan.txt]

Unihan UCN

“U+4E00” [used by Unicode Inc.]

internal Python unicode

u”u4e00” [this is the most useful form!]

internal Python ‘utf8’

“\xe4\xb8\x80”

internal Python ‘gb2312’

“\xd2\xbb”

internal Python ‘euc-cn’

“\xd2\xbb”

internal Python ‘gb18030’

“\xd2\xbb”

See these resources for more information:
  • Wikipedia “Extended_Unix_Code” article

    • “EUC-CN is the usual way to use the GB2312 standard for simplified Chinese characters … the ISO-2022 form of GB2312 is not normally used”

  • Wikipedia “HZ_(encoding)” article (the example conversion)

  • Wikipedia “Numeric_character_reference” article

  • Unihan (look for “Encoding forms”, “Mappings to Major Standards”)

cihai.conversion.hexd(n)[source]

Return hex digits (strip ‘0x’ at the beginning).

Return type:

str

Examples

>>> hexd(19968)
'4e00'
cihai.conversion.kuten_to_gb2312(kuten)[source]

Convert GB kuten / quwei form (94 zones * 94 points) to GB2312-1980 / ISO-2022-CN hex.

Return type:

bytes

Examples

>>> kuten_to_gb2312("5027")
b'523b'
cihai.conversion.gb2312_to_euc(gb2312hex)[source]

Convert GB2312-1980 hex (internal representation) to EUC-CN hex (the “external encoding”).

Return type:

bytes

Examples

>>> gb2312_to_euc("30A1")
b'b0121'
cihai.conversion.euc_to_python(hexstr)[source]

Convert a EUC-CN (GB2312) hex to a Python unicode string.

Return type:

str

Examples

>>> euc_to_python(b"A1A4")
'\\xA1\\xA4'
>>> euc_to_python(b"3041")
'\\x30\\x41'
cihai.conversion.euc_to_utf8(euchex)[source]

Convert EUC hex (e.g. b”d2bb”) to UTF8 hex (e.g. “e4 b8 80”).

Return type:

str

Examples

>>> euc_to_utf8(b"d2bb")
'匯'
>>> euc_to_utf8(b"A4A6")
'う'
cihai.conversion.ucn_to_unicode(ucn)[source]

Convert Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Unicode.

Return type:

str

Examples

>>> ucn_to_unicode("U+4E00")
'一'
>>> ucn_to_unicode("4E00")
'一'
cihai.conversion.euc_to_unicode(hexstr)[source]

Return EUC-CN (GB2312) hex to a Python unicode.

Return type:

str

Parameters:

hexstr (bytes) –

Returns:

Python unicode e.g. u'\\u4e00' / ‘一’.

Return type:

unicode

Examples

>>> u'\u4e00'.encode('gb2312').decode('utf-8')
'\u04bb'
>>> (b'\\x' + b'd2' + b'\\x' + b'bb').replace('\\x', '') \  
... .decode('hex').decode('utf-8')
u'\u04bb'

Note: bytes don’t have a .replace:

>>> gb_enc = gb_enc.replace('\\x', '').decode('hex')  
>>> gb_enc.decode('string_escape')  # Won't work with Python 3.x.  
cihai.conversion.python_to_ucn(uni_char, as_bytes=False)[source]

Return UCN character from Python Unicode character.

Converts a one character Python unicode string (e.g. u’\u4e00’) to the corresponding Unicode UCN (‘U+4E00’).

Return type:

Union[bytes, str]

Examples

>>> python_to_ucn(u'\\u4e00')
'U+4E00'
>>> python_to_ucn('一')
'U+4E00'
cihai.conversion.python_to_euc(uni_char, as_bytes=False)[source]

Return EUC character from a Python Unicode character.

Converts a one character Python unicode string (e.g. u’\u4e00’) to the corresponding EUC hex (‘d2bb’).

Return type:

Union[bytes, str]

cihai.conversion.ucnstring_to_unicode(ucn_string)[source]

Return ucnstring as Unicode.

Return type:

str

Examples

>>> ucnstring_to_unicode('U+7A69')
'穩'
cihai.conversion.ucnstring_to_python(ucn_string)[source]

Return string with Unicode UCN (e.g. “U+4E00”) to native Python Unicode (u’\u4e00’).

Return type:

bytes

>>> ucnstring_to_python('U+7A69')
b'\xe7\xa9\xa9'
cihai.conversion.parse_var(var)[source]

Return tuple consisting of a string and a tag, or None, if none is specified.

Return type:

Tuple[str, Optional[str]]

cihai.conversion.parse_vars(_vars)[source]

Return an iterator of (char, tag) tuples.

Return type:

Generator[Tuple[str, Optional[str]], str, None]

cihai.conversion.parse_untagged(_vars)[source]

Return an iterator of chars.

Return type:

Iterator[Any]