cihai

cihai - United front to provide open, accessible, and standardized access to CJK data

Python Package Documentation Status Build Status Code Coverage License

Current Datasets

Planned datasets

For all data sets, the goal is to achieve:

  • Clear and permissive a licensing for public and private use
  • Compatibilty with Data Packages, for data to be language agnostic and consistent
  • Open source scripting used to process data into a common format
Set License Data Package Project
Unihan OK [Unhn-L] OK [Unhn-D] OK [Unhn-P]
edict OK TODO TODO
cedict OK [CDCT-L] TODO TODO
cedictgr OK TODO TODO
handedict OK TODO TODO
cfdict OK TODO TODO

Tool

In development

  • Single tool for interfacing with CJK data, compare to cjklib.
  • API, in python, for programatically interfacing with data.
  • Compatible with python 2.7, 3.3+, and pypy/pypy3.
  • Designed against a robust test suite. See Travis Builds and Revision History.
  • Supports Unihan, upcoming support for character decomposition, dictionaries (CEDict).
  • Extensible. For new data sets, read more about how you can extend cihai to support new datapackages compatible datasets.
  • For more, see internals for design philosophy.

Workgroup and Standardization

  • Find undigitized data sets relating to CJK
  • Clarifying and negotiate license details of data sets, see permissively licensing your dataset.
  • Create standardized, consistent packages for all data sets
  • Maintain aforementioned datasets
  • Continue to improve current infrastructure and packages while seeking out rare and undigitized CJK data for preservation and access

Usage

CLI usage

Set up config to point to a database you want to import datasets into (and read from).

debug: True
database:
  url: 'sqlite:///${data_dir}/cihai.db'  # sqlalchemy db url
datasets:
  - 'cihai.datasets.unihan'

Then you may point to the config with the -c argument, $ cihai -c path/to/config.yaml.

Troubleshooting

Python 2.7 and UCS

Note, to get this working on python 2.7, you must have python built with UCS4 via --enable-unicode=ucs4. You can test for UCS4 with:

>>> import sys
>>> sys.maxunicode > 0xffff
True

Most packaged and included python distributions will already be build with UCS4 (such as Ubuntu’s system python). On python 3.3 and greater, this distinction no longer exists, no action is needed.

Python support Python 2.7, >= 3.3, pypy
Source https://github.com/cihai/cihai
Docs https://cihai.git-pull.com
Changelog https://cihai.git-pull.com/en/latest/history.html
API https://cihai.git-pull.com/en/latest/api.html
Issues https://github.com/cihai/cihai/issues
Travis https://travis-ci.org/cihai/cihai
Test coverage https://codecov.io/gh/cihai/cihai
pypi https://pypi.python.org/pypi/cihai
OpenHub https://www.openhub.net/p/cihai
License BSD.
git repo
$ git clone https://github.com/cihai/cihai.git
install stable
$ pip install cihai
install dev
$ git clone https://github.com/cihai/cihai.git cihai
$ cd ./cihai
$ virtualenv .env
$ source .env/bin/activate
$ pip install -e .
tests
$ python setup.py test
[Unhn-L]http://unicode.org/charts/unihan.html#Disclaimers
[Unhn-D]https://raw.githubusercontent.com/cihai/cihaidata-unihan/master/datapackage.json
[Unhn-P]https://cihaidata-unihan.git-pull.com/
[CDCT-L]https://www.mdbg.net/chinese/dictionary?page=cedict