API#

Cihai core functionality.

exception cihai.core.CihaiConfigError[source]#

Bases: CihaiException

cihai.core.is_valid_config(config)[source]#
Return type:

TypeGuard[ConfigDict]

class cihai.core.Cihai(config=None, unihan=True)[source]#

Bases: object

Central application object.

By default, this automatically adds the UNIHAN dataset.

config#
Type:

dict

Notes

Inspired by the early pypa/warehouse application object [1]_.

Configuration templates

The config dict parameter supports a basic template system for replacing XDG Base Directory directory variables, tildes and environmentas variables. This is done by passing the option dict through cihai.config.expand_config() during initialization.

Examples

To use cihai programmatically, invoke and install the UNIHAN [2]_ dataset:

#!/usr/bin/env python
import typing as t

from cihai.core import Cihai


def run(unihan_options: t.Optional[t.Dict[str, object]] = None) -> None:
    if unihan_options is None:
        unihan_options = {}
    c = Cihai()

    if not c.unihan.is_bootstrapped:  # download and install Unihan to db
        c.unihan.bootstrap(unihan_options)

    query = c.unihan.lookup_char("好")
    glyph = query.first()

    assert glyph is not None
    print("lookup for 好: %s" % glyph.kDefinition)

    query = c.unihan.reverse_char("good")
    print('matches for "good": %s ' % ", ".join([glph.char for glph in query]))


if __name__ == "__main__":
    run()

Above: is_bootstrapped can check if the system has the database installed.

References

Parameters:
  • config (dict, optional) –

  • unihan (boolean, optional) – Bootstrap the core UNIHAN dataset (recommended)

default_config: UntypedDict = {'database': {'url': 'sqlite:///{user_data_dir}/cihai.db'}, 'datasets': {}, 'debug': False, 'dirs': {'cache': PosixPath('/home/runner/.cache/cihai'), 'data': PosixPath('/home/runner/.local/share/cihai'), 'log': PosixPath('/home/runner/.cache/cihai/log')}, 'plugins': {}}#

dict of default config, can be monkey-patched during tests

unihan: Unihan#
config: ConfigDict#
sql: Database#

Database instance

Type:

cihai.db.Database

bootstrap()[source]#
Return type:

None

add_dataset(_cls, namespace)[source]#
Return type:

None

classmethod from_file(config_path, *args, **kwargs)[source]#

Create a Cihai instance from a JSON or YAML config.

Return type:

Cihai

Parameters:

config_path (str, optional) – path to custom config file

Returns:

application object

Return type:

Cihai

Configuration#

cihai.config.expand_config(d, dirs=<appdirs.AppDirs object>)[source]#

Expand configuration XDG variables, environmental variables, and tildes.

Return type:

None

Parameters:
  • d (dict) – config information

  • dirs (appdirs.AppDirs) – XDG application mapping

Notes

Environmentable variables are expanded via os.path.expandvars(). So ${PWD} would be replaced by the current PWD in the shell, ${USER} would be the user running the app.

XDG variables are expanded via str.format(). These do not have a dollar sign. They are:

  • {user_cache_dir}

  • {user_config_dir}

  • {user_data_dir}

  • {user_log_dir}

  • {site_config_dir}

  • {site_data_dir}

Database#

Cihai core functionality.

class cihai.db.Database(config)[source]#

Bases: object

Cihai SQLAlchemy instance

base: AutomapBase#

sqlalchemy.ext.automap.AutomapBase instance.

engine: Engine#

sqlalchemy.engine.Engine instance.

metadata: MetaData#

sqlalchemy.schema.MetaData instance.

session: Session#

sqlalchemy.orm.session.Session instance.

reflect_db()[source]#

No-op to reflect db info.

This is available as a method so the database can be reflected outside initialization (such bootstrapping unihan during CLI usage).

Return type:

None

Extending#

Cihai Plugin System

Status: Experimental, API can change

As a pilot, the UNIHAN library, and an plugin for it, in #131 [1]_

You can bring any data layout / backend you like to cihai.

For convenience, you can use cihai’s configuration namespace and SQLAlchemy settings.

You can also create plugins which extend another. So if Unihan doesn’t have a lookup for variant glyphs, this can be added.

class cihai.extend.ConfigMixin[source]#

Bases: object

This piggybacks cihai’s global config state, as well as your datasets.

Cihai will automatically manage the user’s config, as well as your datasets, neatly in XDG.

Raises:
  • Functions inside, and what you write relating to dataset config should return

  • CihaiDatasetConfigException (CihaiDatasetException)

  • config.cihai = links directly back to Cihai's configuration dictionary

  • (todo note – make this non-mutable property):

:raises config : dict: your local user’s config :raises check() : function, optional: this is ran on start. it can raise DatasetConfigException :raises default_config : your dataset’s default configuration: :raises get_default_config : override function in case you’d like custom configs (for: instance if you want a platform to use a different db driver, or do version checks, etc.) internal functions use get_default_config()

class cihai.extend.SQLAlchemyMixin[source]#

Bases: object

Your dataset can use any backend you’d like, we provide a backend for you, that automatically piggybacks on cihai’s zero-config, XDG / SQLAchemy configuration. So it’s preconfigured for the user.

In addition, this mixin gives you access to any other of the user’s sqlalchemy sql that use this mixin. So if you want a dataset that utilizes UNIHAN, you can access that easily.

This will provide the following instance-level properties in methods:

When you have access, it’s expected to keep your tables / databases namespaced so they don’t clobber.

sql: Database#
engine: Engine#

sqlalchemy.engine.Engine instance.

metadata: MetaData#

sqlalchemy.schema.MetaData instance.

session: Session#

sqlalchemy.orm.session.Session instance.

base: AutomapBase#

sqlalchemy.ext.automap.AutomapBase instance.

class cihai.extend.Dataset[source]#

Bases: object

Cihai dataset, e.g. UNIHAN.

See also

cihai.data.unihan.dataset.Unihan

reference implementation

bootstrap()[source]#
Return type:

None

add_plugin(_cls, namespace, bootstrap=True)[source]#
Return type:

None

class cihai.extend.DatasetPlugin[source]#

Bases: object

Extend the functionality of datasets with custom methods, actions, etc.

See also

cihai.data.unihan.dataset.UnihanVariants

reference implementation

Constants#

cihai.constants.app_dirs = <appdirs.AppDirs object>#

XDG App directory locations

cihai.constants.DEFAULT_CONFIG: UntypedDict = {'database': {'url': 'sqlite:///{user_data_dir}/cihai.db'}, 'datasets': {}, 'debug': False, 'dirs': {'cache': PosixPath('/home/runner/.cache/cihai'), 'data': PosixPath('/home/runner/.local/share/cihai'), 'log': PosixPath('/home/runner/.cache/cihai/log')}, 'plugins': {}}#

Default configuration

cihai.constants.UNIHAN_CONFIG: UntypedDict = {'datasets': {'unihan': 'cihai.data.unihan.dataset.Unihan'}}#

User will be prompted to automatically configure their installation for UNIHAN

UNIHAN Dataset#

Bootstrapping#

cihai.data.unihan.bootstrap.bootstrap_unihan(engine, metadata, options=None)[source]#
Return type:

None

cihai.data.unihan.bootstrap.is_bootstrapped(metadata)[source]#

Return True if cihai is correctly bootstrapped.

Return type:

bool

cihai.data.unihan.bootstrap.create_unihan_table(columns, metadata)[source]#

Create table and return sqlalchemy.Table.

Return type:

Table

Parameters:
Returns:

Newly created table with columns and index.

Return type:

sqlalchemy.schema.Table

class cihai.data.unihan.dataset.Unihan[source]#

Bases: Dataset, SQLAlchemyMixin

char: str#
kDefinition: str#
kTraditionhalVariant: str#
kSimplifiedVariant: str#
tagged_vars: Callable[[str], ParsedVars]#
untagged_vars: Callable[[str], UntaggedVars]#
bootstrap(options=None)[source]#
Return type:

None

lookup_char(char)[source]#

Return character information from datasets.

Return type:

Query

Parameters:

char (str) – character / string to lookup

Returns:

list of matches

Return type:

sqlalchemy.orm.query.Query

reverse_char(hints)[source]#

Return QuerySet of objects from SQLAlchemy of results.

Return type:

Query

Parameters:

hints (list of str) – strings to lookup

Returns:

reverse matches

Return type:

sqlalchemy.orm.query.Query

with_fields(fields)[source]#

Returns list of characters with information for certain fields.

Return type:

Query

Parameters:

*fields (list of str) – fields for which information should be available

Returns:

list of matches

Return type:

sqlalchemy.orm.query.Query

property is_bootstrapped: bool#

Return True if UNIHAN and database is set up.

Returns:

True if Unihan application fixture data installed.

Return type:

bool

add_plugin(_cls, namespace, bootstrap=True)[source]#
Return type:

None

sql: Database#
engine: Engine#

sqlalchemy.engine.Engine instance.

metadata: MetaData#

sqlalchemy.schema.MetaData instance.

session: Session#

sqlalchemy.orm.session.Session instance.

base: AutomapBase#

sqlalchemy.ext.automap.AutomapBase instance.

cihai.data.unihan.constants.UNIHAN_FILES = ['Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt']#

Mapping of files from unihan-etl (UNIHAN database)

cihai.data.unihan.constants.UNIHAN_FIELDS: List[str] = ['kAccountingNumeric', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCihaiT', 'kCompatibilityVariant', 'kDefinition', 'kFenn', 'kFourCornerCode', 'kFrequency', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kJapaneseKun', 'kJapaneseOn', 'kKorean', 'kMandarin', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKanWa', 'kRSKangXi', 'kRSKorean', 'kRSUnicode', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kVietnamese', 'kXHC1983', 'kZVariant']#

Mapping of field names from unihan-etl (UNIHAN database)

cihai.data.unihan.constants.UNIHAN_ETL_DEFAULT_OPTIONS = {'expand': False, 'fields': ['kAccountingNumeric', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCihaiT', 'kCompatibilityVariant', 'kDefinition', 'kFenn', 'kFourCornerCode', 'kFrequency', 'kGradeLevel', 'kHDZRadBreak', 'kHKGlyph', 'kHangul', 'kHanyuPinlu', 'kHanyuPinyin', 'kJapaneseKun', 'kJapaneseOn', 'kKorean', 'kMandarin', 'kOtherNumeric', 'kPhonetic', 'kPrimaryNumeric', 'kRSAdobe_Japan1_6', 'kRSJapanese', 'kRSKanWa', 'kRSKangXi', 'kRSKorean', 'kRSUnicode', 'kSemanticVariant', 'kSimplifiedVariant', 'kSpecializedSemanticVariant', 'kTang', 'kTotalStrokes', 'kTraditionalVariant', 'kVietnamese', 'kXHC1983', 'kZVariant'], 'format': 'python', 'input_files': ['Unihan_DictionaryLikeData.txt', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', 'Unihan_Variants.txt']}#

Default settings passed to unihan-etl

Variants plugin#

class cihai.data.unihan.dataset.UnihanVariants[source]#

Bases: DatasetPlugin, SQLAlchemyMixin

bootstrap()[source]#
Return type:

None

sql: Database#
engine: Engine#

sqlalchemy.engine.Engine instance.

metadata: MetaData#

sqlalchemy.schema.MetaData instance.

session: Session#

sqlalchemy.orm.session.Session instance.

base: AutomapBase#

sqlalchemy.ext.automap.AutomapBase instance.

Conversion#

cihai.conversion.euc_to_unicode(hexstr)[source]#

Return EUC-CN (GB2312) hex to a Python unicode.

Return type:

str

Parameters:

hexstr (bytes) –

Returns:

Python unicode e.g. u'\\u4e00' / ‘一’.

Return type:

unicode

Examples

>>> u'\u4e00'.encode('gb2312').decode('utf-8')
'\u04bb'
>>> (b'\\x' + b'd2' + b'\\x' + b'bb').replace('\\x', '') \  
... .decode('hex').decode('utf-8')
u'\u04bb'

Note: bytes don’t have a .replace:

>>> gb_enc = gb_enc.replace('\\x', '').decode('hex')  
>>> gb_enc.decode('string_escape')  # Won't work with Python 3.x.  
cihai.conversion.euc_to_utf8(euchex)[source]#

Convert EUC hex (e.g. “d2bb”) to UTF8 hex (e.g. “e4 b8 80”).

Return type:

str

cihai.conversion.gb2312_to_euc(gb2312hex)[source]#

Convert GB2312-1980 hex (internal representation) to EUC-CN hex (the “external encoding”)

Return type:

bytes

cihai.conversion.kuten_to_gb2312(kuten)[source]#

Convert GB kuten / quwei form (94 zones * 94 points) to GB2312-1980 / ISO-2022-CN hex (internal representation)

Return type:

bytes

cihai.conversion.python_to_euc(uni_char, as_bytes=False)[source]#

Return EUC character from a Python Unicode character.

Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding EUC hex (‘d2bb’).

Return type:

Union[bytes, str]

cihai.conversion.python_to_ucn(uni_char, as_bytes=False)[source]#

Return UCN character from Python Unicode character.

Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding Unicode UCN (‘U+4E00’).

Return type:

Union[bytes, str]

cihai.conversion.ucn_to_unicode(ucn)[source]#

Convert a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)

Return type:

str

cihai.conversion.ucnstring_to_python(ucn_string)[source]#

Return string with Unicode UCN (e.g. “U+4E00”) to native Python Unicode (u’u4e00’).

Return type:

bytes

cihai.conversion.ucnstring_to_unicode(ucn_string)[source]#

Return ucnstring as Unicode.

Return type:

str

Exceptions#

When using cihai via Python, you can catch Cihai-specific exceptions via these. All Cihai-specific exceptions are catchable via CihaiException since its the base exception.

Exceptions raised from the Cihai library.

exception cihai.exc.CihaiException[source]#

Bases: Exception

Base Cihai Exception class.

add_note()#

Exception.add_note(note) – add a note to the exception

args#
with_traceback()#

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception cihai.exc.ImportStringError(import_name, exception)[source]#

Bases: ImportError, CihaiException

Provides information about a failed import_string() attempt.

Notes

This is from werkzeug.utils d36aaf1 on August 20 2022, LICENSE BSD. https://github.com/pallets/werkzeug

Changes: - Deferred load import import_string from cihai.util - Format with black

import_name: str#

String in dotted notation that failed to be imported.

add_note()#

Exception.add_note(note) – add a note to the exception

args#
msg#

exception message

name#

module name

path#

module path

with_traceback()#

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception: BaseException#

Wrapped exception.

Utilities#

Utility and helper methods for cihai.

cihai.utils.supports_wide()[source]#

Return affirmative if python interpreter supports wide characters.

Return type:

bool

Returns:

True if python supports wide character sets

Return type:

bool

cihai.utils.import_string(import_name, silent=False)[source]#

Imports an object based on a string.

This is useful if you want to use import paths as endpoints or something similar. An import path can be specified either in dotted notation (xml.sax.saxutils.escape) or with a colon as object delimiter (xml.sax.saxutils:escape).

If silent is True the return value will be None if the import fails.

Return type:

Any

Parameters:
  • import_name (string) – the dotted name for the object to import.

  • silent (bool) – if set to True import errors are ignored and None is returned instead.

Return type:

imported object

Raises:

cihai.exc.ImportStringError (ImportError, cihai.exc.CihaiException)

Notes

This is from werkzeug.utils d36aaf1 on May 23, 2022, LICENSE BSD. https://github.com/pallets/werkzeug

Changes: - Exception raised is cihai.exc.ImportStringError - Format with black