Conversion - cihai.conversion
¶
Conversion functions for various CJK encodings and representations.
Notes
Original methods and docs based upon ltchinese, license MIT Steven Kryskalla.
Added in version 0.1: Python 2/3 compatibility.
PEP8, PEP257.
int()
casting for comparisonsPython 3 support.
Python 3 fix for
ucn_to_python()
.Python 3
__future__
statements.All methods converting to
_python
will returnUnicode
.All methods converting Unicode to x will return bytestring.
Add
ucnstring_to_python()
Any other change upon @ conversion.py @9227813.
The following terms are used to represent the encodings / representation used in the conversion functions (the samples on the right are for the character U+4E00 (yi1; “one”)):
GB2312 (Kuten/Quwei form) |
“5027” [used in the “GB2312” field of Unihan.txt] |
GB2312 (ISO-2022 form) |
“523B” [the “internal representation” of GB code] |
EUC-CN |
|
UTF-8 |
“E4 B8 80” [used in the “UTF-8” field in Unihan.txt] |
Unihan UCN |
“U+4E00” [used by Unicode Inc.] |
internal Python unicode |
u”u4e00” [this is the most useful form!] |
internal Python ‘utf8’ |
“\xe4\xb8\x80” |
internal Python ‘gb2312’ |
“\xd2\xbb” |
internal Python ‘euc-cn’ |
“\xd2\xbb” |
internal Python ‘gb18030’ |
“\xd2\xbb” |
- See these resources for more information:
Wikipedia “Extended_Unix_Code” article
“EUC-CN is the usual way to use the GB2312 standard for simplified Chinese characters … the ISO-2022 form of GB2312 is not normally used”
Wikipedia “HZ_(encoding)” article (the example conversion)
Wikipedia “Numeric_character_reference” article
Unihan (look for “Encoding forms”, “Mappings to Major Standards”)
- cihai.conversion.hexd(n)[source]¶
Return hex digits (strip ‘0x’ at the beginning).
- Return type:
Examples
>>> hexd(19968) '4e00'
- cihai.conversion.kuten_to_gb2312(kuten)[source]¶
Convert GB kuten / quwei form (94 zones * 94 points) to GB2312-1980 / ISO-2022-CN hex.
- Return type:
Examples
>>> kuten_to_gb2312("5027") b'523b'
- cihai.conversion.gb2312_to_euc(gb2312hex)[source]¶
Convert GB2312-1980 hex (internal representation) to EUC-CN hex (the “external encoding”).
- Return type:
Examples
>>> gb2312_to_euc("30A1") b'b0121'
- cihai.conversion.euc_to_python(hexstr)[source]¶
Convert a EUC-CN (GB2312) hex to a Python unicode string.
- Return type:
Examples
>>> euc_to_python(b"A1A4") '\\xA1\\xA4'
>>> euc_to_python(b"3041") '\\x30\\x41'
- cihai.conversion.euc_to_utf8(euchex)[source]¶
Convert EUC hex (e.g. b”d2bb”) to UTF8 hex (e.g. “e4 b8 80”).
- Return type:
Examples
>>> euc_to_utf8(b"d2bb") '匯'
>>> euc_to_utf8(b"A4A6") 'う'
- cihai.conversion.ucn_to_unicode(ucn)[source]¶
Convert Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Unicode.
- Return type:
Examples
>>> ucn_to_unicode("U+4E00") '一'
>>> ucn_to_unicode("4E00") '一'
- cihai.conversion.euc_to_unicode(hexstr)[source]¶
Return EUC-CN (GB2312) hex to a Python unicode.
- Return type:
- Parameters:
hexstr (bytes)
- Returns:
Python unicode e.g.
u'\\u4e00'
/ ‘一’.- Return type:
unicode
Examples
>>> u'\u4e00'.encode('gb2312').decode('utf-8') '\u04bb'
>>> (b'\\x' + b'd2' + b'\\x' + b'bb').replace('\\x', '') \ ... .decode('hex').decode('utf-8') u'\u04bb'
Note: bytes don’t have a
.replace
:>>> gb_enc = gb_enc.replace('\\x', '').decode('hex') >>> gb_enc.decode('string_escape') # Won't work with Python 3.x.
- cihai.conversion.python_to_ucn(uni_char, as_bytes=False)[source]¶
Return UCN character from Python Unicode character.
Converts a one character Python unicode string (e.g. u’\u4e00’) to the corresponding Unicode UCN (‘U+4E00’).
Examples
>>> python_to_ucn(u'\\u4e00') 'U+4E00'
>>> python_to_ucn('一') 'U+4E00'
- cihai.conversion.python_to_euc(uni_char, as_bytes=False)[source]¶
Return EUC character from a Python Unicode character.
Converts a one character Python unicode string (e.g. u’\u4e00’) to the corresponding EUC hex (‘d2bb’).
- cihai.conversion.ucnstring_to_unicode(ucn_string)[source]¶
Return ucnstring as Unicode.
- Return type:
Examples
>>> ucnstring_to_unicode('U+7A69') '穩'
- cihai.conversion.ucnstring_to_python(ucn_string)[source]¶
Return string with Unicode UCN (e.g. “U+4E00”) to native Python Unicode (u’\u4e00’).
- Return type:
>>> ucnstring_to_python('U+7A69') b'\xe7\xa9\xa9'
- cihai.conversion.parse_var(var)[source]¶
Return tuple consisting of a string and a tag, or None, if none is specified.