API

Cihai core functionality.

class cihai.core.Cihai(config=None, unihan=True)[source]

Central application object.

By default, this automatically adds the UNIHAN dataset.

config

dict

Notes

Inspired by the early pypa/warehouse applicaton object [1].

Configuration templates

The config dict parameter supports a basic template system for replacing XDG Base Directory directory variables, tildes and environmentas variables. This is done by passing the option dict through cihai.config.expand_config() during initialization.

Examples

To use cihai programatically, invoke and install the UNIHAN [2] dataset:

#!/usr/bin/env python
# -*- coding: utf8 - *-
from __future__ import print_function, unicode_literals

from cihai.core import Cihai


def run(unihan_options={}):
    c = Cihai()

    if not c.unihan.is_bootstrapped:  # download and install Unihan to db
        c.unihan.bootstrap(unihan_options)

    query = c.unihan.lookup_char('好')
    glyph = query.first()
    print("lookup for 好: %s" % glyph.kDefinition)

    query = c.unihan.reverse_char('good')
    print('matches for "good": %s ' % ', '.join([glph.char for glph in query]))


if __name__ == '__main__':
    run()

Above: is_bootstrapped can check if the system has the database installed.

References

[1](1, 2) UNICODE HAN DATABASE (UNIHAN) documentation. https://www.unicode.org/reports/tr38/. Accessed March 31st, 2018.
[2]PyPA Warehouse on GitHub. https://github.com/pypa/warehouse. Accessed sometime in 2013.
classmethod from_file(config_path=None, *args, **kwargs)[source]

Create a Cihai instance from a JSON or YAML config.

Parameters:config_path (str, optional) – path to custom config file
Returns:application object
Return type:Cihai
config = None

Configuration dictionary

default_config = {u'database': {u'url': u'sqlite:///{user_data_dir}/cihai.db'}, u'debug': False, u'dirs': {u'cache': u'{user_cache_dir}', u'data': u'{user_data_dir}', u'log': u'{user_log_dir}'}}

dict of default config, can be monkey-patched during tests

sql = None

cihai.db.Database – Database instance

Configuration

cihai.config.expand_config(d, dirs)[source]

Expand configuration XDG variables, environmental variables, and tildes.

Parameters:
  • d (dict) – config information
  • dirs (appdirs.AppDirs) – XDG application mapping

Notes

Environmentable variables are expanded via os.path.expandvars(). So ${PWD} would be replaced by the current PWD in the shell, ${USER} would be the user running the app.

XDG variables are expanded via str.format(). These do not have a dollar sign. They are:

  • {user_cache_dir}
  • {user_config_dir}
  • {user_data_dir}
  • {user_log_dir}
  • {site_config_dir}
  • {site_data_dir}

Database

Cihai core functionality.

class cihai.db.Database(config)[source]

Cihai SQLAlchemy instance

reflect_db()[source]

No-op to reflect db info.

This is available as a method so the database can be reflected outside initialization (such bootstrapping unihan during CLI usage).

base = None

sqlalchemy.ext.automap.AutomapBase instance.

engine = None

sqlalchemy.engine.Engine instance.

metadata = None

sqlalchemy.schema.MetaData instance.

session = None

sqlalchemy.orm.session.Session instance.

Extending

Cihai Plugin System

Status: Experimental, API can change

As a pilot, the UNIHAN library, and an plugin for it, in #131 [1]

You can bring any data layout / backend you like to cihai.

For convenience, you can use cihai’s configuration namespace and SQLAlchemy settings.

You can also create plugins which extend another. So if Unihan doesn’t have a lookup for variant glyphs, this can be added.

class cihai.extend.ConfigMixin[source]

This piggybacks cihai’s global config state, as well as your datasets.

Cihai will automatically manage the user’s config, as well as your datasets, neatly in XDG.

Raises:
  • Functions inside, and what you write relating to dataset config should return

  • CihaiDatasetConfigException (CihaiDatasetException)

  • config.cihai = links directly back to Cihai’s configuration dictionary

  • (todo note: make this non-mutable property)

  • config : dict – your local user’s config

  • check() : function, optional – this is ran on start. it can raise DatasetConfigException

  • default_config : your dataset’s default configuration

  • get_default_config : override function in case you’d like custom configs (for – instnace if you want a platform to use a different db driver, or do version checks, etc.)

    internal functions use get_default_config()

class cihai.extend.Dataset[source]

Cihai dataset, e.g. UNIHAN.

See also

cihai.data.unihan.dataset.Unihan
reference implementation
class cihai.extend.DatasetPlugin[source]

Extend the functionality of datasets with custom methods, actions, etc.

See also

cihai.data.unihan.dataset.UnihanVariants
reference implementation
class cihai.extend.SQLAlchemyMixin[source]

Your dataset can use any backend you’d like, we provide a backend for you, that automatically piggybacks on cihai’s zero-config, XDG / SQLAchemy configuration. So it’s preconfigured for the user.

In addition, this mixin gives you access to any other of the user’s sqlalchemy sql that use this mixin. So if you want a dataset that utilitizes UNIHAN, you can access that easily.

This will provide the following instance-level properties in methods:

When you have access, it’s expected to keep your tables / databases namespaced so they don’t clobber.

base = None

sqlalchemy.ext.automap.AutomapBase instance.

engine = None

sqlalchemy.engine.Engine instance.

metadata = None

sqlalchemy.schema.MetaData instance.

session = None

sqlalchemy.orm.session.Session instance.

Constants

cihai.constants.DEFAULT_CONFIG = {u'database': {u'url': u'sqlite:///{user_data_dir}/cihai.db'}, u'debug': False, u'dirs': {u'cache': u'{user_cache_dir}', u'data': u'{user_data_dir}', u'log': u'{user_log_dir}'}}

Default configuration

cihai.constants.UNIHAN_CONFIG = {u'datasets': {u'unihan': u'cihai.data.unihan.dataset.Unihan'}}

User will be prompted to automatically configure their installation for UNIHAN

UNIHAN Dataset

Bootstrapping

cihai.data.unihan.bootstrap.bootstrap_unihan(metadata, options={})[source]

Download, extract and import unihan to database.

cihai.data.unihan.bootstrap.create_unihan_table(columns, metadata)[source]

Create table and return sqlalchemy.Table.

Parameters:
Returns:

Newly created table with columns and index.

Return type:

sqlalchemy.schema.Table

cihai.data.unihan.bootstrap.is_bootstrapped(metadata)[source]

Return True if cihai is correctly bootstrapped.

class cihai.data.unihan.dataset.Unihan[source]

Bases: cihai.extend.Dataset, cihai.extend.SQLAlchemyMixin

lookup_char(char)[source]

Return character information from datasets.

Parameters:char (str) – character / string to lookup
Returns:list of matches
Return type:sqlalchemy.orm.query.Query
reverse_char(hints)[source]

Return QuerySet of objects from SQLAlchemy of results.

Parameters:hints (list of str) – strings to lookup
Returns:reverse matches
Return type:sqlalchemy.orm.query.Query
with_fields(*fields)[source]

Returns list of characters with information for certain fields.

Parameters:*fields (list of str) – fields for which information should be available
Returns:list of matches
Return type:sqlalchemy.orm.query.Query
is_bootstrapped

Return True if UNIHAN and database is set up.

Returns:True if Unihan application fixture data installed.
Return type:bool
cihai.data.unihan.constants.UNIHAN_ETL_DEFAULT_OPTIONS = {u'expand': False, u'fields': [u'kAccountingNumeric', u'kCangjie', u'kCantonese', u'kCheungBauer', u'kCihaiT', u'kCompatibilityVariant', u'kDefinition', u'kFenn', u'kFourCornerCode', u'kFrequency', u'kGradeLevel', u'kHDZRadBreak', u'kHKGlyph', u'kHangul', u'kHanyuPinlu', u'kHanyuPinyin', u'kJapaneseKun', u'kJapaneseOn', u'kKorean', u'kMandarin', u'kOtherNumeric', u'kPhonetic', u'kPrimaryNumeric', u'kRSAdobe_Japan1_6', u'kRSJapanese', u'kRSKanWa', u'kRSKangXi', u'kRSKorean', u'kRSUnicode', u'kSemanticVariant', u'kSimplifiedVariant', u'kSpecializedSemanticVariant', u'kTang', u'kTotalStrokes', u'kTraditionalVariant', u'kVietnamese', u'kXHC1983', u'kZVariant'], u'format': u'python', u'input_files': [u'Unihan_DictionaryLikeData.txt', u'Unihan_IRGSources.txt', u'Unihan_NumericValues.txt', u'Unihan_RadicalStrokeCounts.txt', u'Unihan_Readings.txt', u'Unihan_Variants.txt']}

Default settings passed to unihan-etl

cihai.data.unihan.constants.UNIHAN_FIELDS = [u'kAccountingNumeric', u'kCangjie', u'kCantonese', u'kCheungBauer', u'kCihaiT', u'kCompatibilityVariant', u'kDefinition', u'kFenn', u'kFourCornerCode', u'kFrequency', u'kGradeLevel', u'kHDZRadBreak', u'kHKGlyph', u'kHangul', u'kHanyuPinlu', u'kHanyuPinyin', u'kJapaneseKun', u'kJapaneseOn', u'kKorean', u'kMandarin', u'kOtherNumeric', u'kPhonetic', u'kPrimaryNumeric', u'kRSAdobe_Japan1_6', u'kRSJapanese', u'kRSKanWa', u'kRSKangXi', u'kRSKorean', u'kRSUnicode', u'kSemanticVariant', u'kSimplifiedVariant', u'kSpecializedSemanticVariant', u'kTang', u'kTotalStrokes', u'kTraditionalVariant', u'kVietnamese', u'kXHC1983', u'kZVariant']

Mapping of field names from unihan-etl (UNIHAN database)

cihai.data.unihan.constants.UNIHAN_FILES = [u'Unihan_DictionaryLikeData.txt', u'Unihan_IRGSources.txt', u'Unihan_NumericValues.txt', u'Unihan_RadicalStrokeCounts.txt', u'Unihan_Readings.txt', u'Unihan_Variants.txt']

Mapping of files from unihan-etl (UNIHAN database)

Variants plugin

class cihai.data.unihan.dataset.UnihanVariants[source]

Bases: cihai.extend.DatasetPlugin

Conversion

cihai.conversion.euc_to_unicode(hexstr)[source]

Return EUC-CN (GB2312) hex to a Python unicode.

Parameters:hexstr (bytes) –
Returns:Python unicode e.g. u'\u4e00' / ‘一’.
Return type:unicode

Examples

>>> u'一'.encode('gb2312').decode('utf-8')
u'һ'
>>> (b'\x' + b'd2' + b'\x' + b'bb').replace('\x', '') \
... .decode('hex').decode('utf-8')
u'һ'

Note: bytes don’t have a .replace:

>>> gb_enc = gb_enc.replace('\x', '').decode('hex')
>>> gb_enc.decode('string_escape')  # Won't work with Python 3.x.
cihai.conversion.euc_to_utf8(euchex)[source]

Convert EUC hex (e.g. “d2bb”) to UTF8 hex (e.g. “e4 b8 80”).

cihai.conversion.gb2312_to_euc(gb2312hex)[source]

Convert GB2312-1980 hex (internal representation) to EUC-CN hex (the “external encoding”)

cihai.conversion.kuten_to_gb2312(kuten)[source]

Convert GB kuten / quwei form (94 zones * 94 points) to GB2312-1980 / ISO-2022-CN hex (internal representation)

cihai.conversion.python_to_euc(uni_char, as_bytes=False)[source]

Return EUC character from a Python Unicode character.

Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding EUC hex (‘d2bb’).

cihai.conversion.python_to_ucn(uni_char, as_bytes=False)[source]

Return UCN character from Python Unicode character.

Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding Unicode UCN (‘U+4E00’).

cihai.conversion.ucn_to_unicode(ucn)[source]

Convert a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)

cihai.conversion.ucnstring_to_python(ucn_string)[source]

Return string with Unicode UCN (e.g. “U+4E00”) to native Python Unicode (u’u4e00’).

cihai.conversion.ucnstring_to_unicode(ucn_string)[source]

Return ucnstring as Unicode.

Exceptions

When using cihai via Python, you can catch Cihai-specific exceptions via these. All Cihai-specific exceptions are catchable via CihaiException since its the base exception.

Exceptions raised from the Cihai library.

exception cihai.exc.CihaiException[source]

Bases: exceptions.Exception

Base Cihai Exception class.

exception cihai.exc.ImportStringError(import_name, exception)[source]

Bases: exceptions.ImportError, cihai.exc.CihaiException

Provides information about a failed import_string() attempt.

Notes

This is from werkzeug.utils c769200 on May 23, LICENSE BSD. https://github.com/pallets/werkzeug

Changes: - Deferred load import import_string from cihai.util - Format with black

exception = None

Wrapped exception.

import_name = None

String in dotted notation that failed to be imported.

Utilities

Utility and helper methods for cihai.

cihai.utils.import_string(import_name, silent=False)[source]

Imports an object based on a string.

This is useful if you want to use import paths as endpoints or something similar. An import path can be specified either in dotted notation (xml.sax.saxutils.escape) or with a colon as object delimiter (xml.sax.saxutils:escape).

If silent is True the return value will be None if the import fails.

Parameters:
  • import_name (string) – the dotted name for the object to import.
  • silent (bool) – if set to True import errors are ignored and None is returned instead.
Returns:

Return type:

imported object

Raises:

cihai.exc.ImportStringError (ImportError, cihai.exc.CihaiException)

Notes

This is from werkzeug.utils c769200 on May 23, LICENSE BSD. https://github.com/pallets/werkzeug

Changes: - Exception raised is cihai.exc.ImportStringError - Add NOQA C901 to avoid complexity lint - Format with black

cihai.utils.merge_dict(base, additional)[source]

Combine two dictionary-like objects.

Notes

Code from https://github.com/pypa/warehouse Copyright 2013 Donald Stufft

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

cihai.utils.supports_wide()[source]

Return affirmative if python interpreter supports wide characters.

Returns:True if python supports wide character sets
Return type:bool