# Config - cihai.config Source: https://cihai.git-pull.com/api/config/ # Config - `cihai.config` ```{eval-rst} .. automodule:: cihai.config :members: :undoc-members: :show-inheritance: ``` --- # Constants - cihai.constants Source: https://cihai.git-pull.com/api/constants/ # Constants - `cihai.constants` ```{eval-rst} .. automodule:: cihai.constants :members: :undoc-members: :show-inheritance: ``` --- # Conversion - cihai.conversion Source: https://cihai.git-pull.com/api/conversion/ (cihai.conversion)= # Conversion - `cihai.conversion` ```{eval-rst} .. automodule:: cihai.conversion :members: :undoc-members: :show-inheritance: ``` --- # Core - cihai.core Source: https://cihai.git-pull.com/api/core/ # Core - `cihai.core` ```{eval-rst} .. automodule:: cihai.core :members: :undoc-members: :show-inheritance: ``` --- # Database - cihai.db Source: https://cihai.git-pull.com/api/db/ # Database - `cihai.db` ```{eval-rst} .. automodule:: cihai.db :members: :undoc-members: :show-inheritance: ``` --- # Exceptions - cihai.exc Source: https://cihai.git-pull.com/api/exc/ # Exceptions - `cihai.exc` When using cihai via Python, you can catch Cihai-specific exceptions via these. All Cihai-specific exceptions are catchable via {exc}`~cihai.exc.CihaiException` since its the base exception. ```{eval-rst} .. automodule:: cihai.exc :members: :undoc-members: :show-inheritance: ``` --- # Extending - cihai.extend Source: https://cihai.git-pull.com/api/extend/ # Extending - `cihai.extend` ```{eval-rst} .. automodule:: cihai.extend :members: :undoc-members: :show-inheritance: ``` --- # API Reference Source: https://cihai.git-pull.com/api/ (api)= (reference)= # API Reference cihai's public API for CJK character lookups and dataset management. :::{warning} cihai is pre-1.0. APIs may change between minor versions. Pin to a range: `cihai>=0.36,<0.37`. If you need an API stabilized please [file an issue](https://github.com/cihai/cihai/issues). ::: ## Core ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} Cihai (core) :link: core :link-type: doc Application object. Bootstrap datasets, run lookups. ::: :::{grid-item-card} Config :link: config :link-type: doc Configuration loading and expansion. ::: :::{grid-item-card} Database :link: db :link-type: doc SQLAlchemy engine/session setup and helpers. ::: :::{grid-item-card} Extend :link: extend :link-type: doc Base classes for datasets and plugins. ::: :::: ## Supporting Modules ::::{grid} 1 2 3 3 :gutter: 2 2 3 3 :::{grid-item-card} Constants :link: constants :link-type: doc Default paths and configuration values. ::: :::{grid-item-card} Conversion :link: conversion :link-type: doc CJK encoding conversion utilities. ::: :::{grid-item-card} Exceptions :link: exc :link-type: doc Exception hierarchy. ::: :::{grid-item-card} Log :link: log :link-type: doc Logging helpers. ::: :::{grid-item-card} Types :link: types :link-type: doc Public type aliases. ::: :::{grid-item-card} Utils :link: utils :link-type: doc Import and general utility helpers. ::: :::: ```{toctree} :hidden: core config constants conversion db exc extend log types utils ``` --- # Logging - cihai.log Source: https://cihai.git-pull.com/api/log/ # Logging - `cihai.log` ```{eval-rst} .. automodule:: cihai.log :members: :undoc-members: :show-inheritance: ``` --- # Typings - cihai.types Source: https://cihai.git-pull.com/api/types/ # Typings - `cihai.types` ```{eval-rst} .. automodule:: cihai.types :members: :undoc-members: :show-inheritance: ``` --- # Utilities - cihai.utils Source: https://cihai.git-pull.com/api/utils/ # Utilities - `cihai.utils` ```{eval-rst} .. automodule:: cihai.utils :members: :undoc-members: :show-inheritance: ``` --- # Datasets Source: https://cihai.git-pull.com/datasets/ (datasets)= (data)= # Datasets Data sources available through cihai. ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} UNIHAN :link: unihan :link-type: doc Unicode Han Database -- readings, meanings, variants. ::: :::: ## Planned datasets For all data sets, the goal is to achieve: - Clear and permissive a licensing for public and private use - Compatibility with [data Packages], for data to be language agnostic and consistent - Open source scripting used to process data into a common format | Set | License | Data Package | Project | | --------- | ----------------- | ---------------------- | ----------------- | | UNIHAN | OK [^cite_unhn-l] | OK [^cite_unhn-d] | OK [^cite_unhn-p] | | edict | OK | TODO | TODO | | cedict | OK [^cite_cdct-l] | TODO | TODO | | cedictgr | OK | TODO | TODO | | handedict | OK | TODO | TODO | | cfdict | OK | MISSING [^cite_cfdict] | UNKNOWN | [data packages]: https://specs.frictionlessdata.io/data-package/ [^cite_unhn-l]: [^cite_unhn-d]: [^cite_unhn-p]: [^cite_cdct-l]: [^cite_cfdict]: The database at is missing. ```{toctree} :hidden: unihan ``` --- # UNIHAN - cihai.data.unihan Source: https://cihai.git-pull.com/datasets/unihan/ # UNIHAN - `cihai.data.unihan` ### Bootstrapping ```{eval-rst} .. automodule:: cihai.data.unihan.bootstrap :members: ``` ```{eval-rst} .. autoclass:: cihai.data.unihan.dataset.Unihan :members: :inherited-members: :show-inheritance: ``` ```{eval-rst} .. automodule:: cihai.data.unihan.constants :members: :inherited-members: :show-inheritance: ``` ### Variants plugin ```{eval-rst} .. autoclass:: cihai.data.unihan.dataset.UnihanVariants :members: :inherited-members: :show-inheritance: ``` --- # Extending Source: https://cihai.git-pull.com/design-and-planning/2013/extending/ --- orphan: true --- (design-and-planning-2013-extending)= # Extending :::{note} This document is part of brain storming the project of cihai. It's for historic purposes only. ::: _Written Late 2013_ ## Minimum usage 1. Create a python module 2. The module has a class with a `get.()` to look up characters by signature `(request, response, *args, **kwargs)`. With `.get()`, your class may be instantiated and passed into `Cihai`. When a user runs `.get()` inside of `Cihai`, it will check your module's `.get()` also: ``` +----------+ | Cihai | The Cihai Class +----------+ ``` It is instantiated with a database to connect to ({class}`sqlalchemy.schema.MetaData`): ```{code-block} python c = Cihai(metadata=metadata) ``` `MetaData` is part of the sqlalchemy library. It holds connection and table information. In this instance, cihai shares this information across all plugins that attach to it. To attach a plugin: ```{code-block} python from MyCihaiModule import MyDataset mydata = MyDataSet() c.use(mydata) ``` `c`, the instance of {class}`Cihai`, may now access `MyDataSet`'s information. ### Code ```{code-block} python c = Cihai() c.use(DatasetExample) print(c.reverse('hao')) >>> { 'definition': 'hao' } print(c.get('你')) >>> { 'definition': 'hao' } ``` ## Growing big The above was an example of the minimum requirement to have your dataset compatible. ### Importing data into database One of the goals of Cihai is to provide a common way to access to Chinese data. To import the data, you must create an SQL schema / table for your data. The pristine format of your data may be in CSV, excel or another format. As long as your data is normalized into a {obj}`dict` that is compatible with the sql table, it is ok. To accommodate this, {class}`Cihai` provides all plugins a instance of {class}`sqlalchemy.schema.MetaData` on creation. [sqlalchemy][sqlalchemy] is the swiss army knife of databases in the python programming language. With an instance of `MetaData`, you will be able to create SQL tables, import and retrieve data. ### Deeper In previous examples, the plugin class with `.get` and `.reverse` character lookups was merged with 1 SQL table. As said previously, it doesn't matter how or where the data comes from. As long as {class}`Cihai` can retrieve data via `.get` with the correct arguments and response. The prior example had the data class combined with a single table. In databases that use multiple tables, you may create a central dataset class with `.get()` and access the tables from there. [sqlalchemy]: http://www.sqlalchemy.org --- # Information Liberation Source: https://cihai.git-pull.com/design-and-planning/2013/information_liberation/ --- orphan: true --- (design-and-planning-2013-information-liberation)= # Information Liberation :::{note} This document is part of brain storming the project of cihai. It's for historic purposes only. ::: _Written Late 2013_ Datasets should be available under the most permissive License possible. So long as they provide attribution to the person's / official institution creating it. Cihai believes in permissive license (Do as you like, for fun, academic, profit-making) but attribute. Common concerns over people over their datasets are: - What if someone uses my dataset for profit purposes? - What if someone doesn't give attribution to my / my colleagues / my institution / my effort? - What if someone doesn't contribute modifications to my / my colleagues / my institution / my effort? If you have not participated in an open source software effort, you would be surprised how people are happy to contribute to a common effort. Efforts like GNU/Linux are world-wide collaborations bringing together a rock-solid OS powering supercomputers, the internet. ## Permissive-licensing your dataset Even if have brought into consideration the fruits of an open source software effort you should know some of them are built upon restrictive, viral licenses which are commercially-unfriendly, complicated and require borrowing _their_ code should mean any non-GPL compatible effort - private _or_ freely available permissive, can't use it without turning the software into GPL too! But GPL licensed software can use permissive software in their efforts! It opened my mind after years of thinking GPL was all principle and virtue! Defend the weak! Protect the innocent! _Freedom._ To add to the confusion, this is referred to by some people as "Free" software. In reality, it's providing open source, but inhibits the real world realities of a value-added, passionate and expressive society, where people want it to be their choice whether they distribute their changes. If you ever built something special and felt your hardwork deserved to be placed into the market to see how people receive it's value, you would understand. But others may find this selfish! If you not seen a permissively-licensed software project, you would be surprised to see, people are contributing these projects despite _there is no fine print requiring them to_. Look into your roots, if you at your core want to open a useful work to the world - why hinder it with a minefield a caveats? The success of open source efforts isn't a product of rules, but a result of fast computers, fast internet, convenient developer tools, brain power, and a community of self-interested / passionate individuals who put in the hours and had the descipline to learn programming who had the courage and desire to help. ### Case study: IPython This is an observation and doesn't infer endorsement by IPython or its contributors. There is no requirement for providing an open source derivative (or an upstream patch) for a modification, so does IPython development go stale? [93 pages of patches] committed to the project. There is no requirement restricting large corporations from using them and giving nothing back, [Microsoft donated $100K to IPython]. Perhaps academic institutions will snub them for using a permissive license? The core developers are academics. Perhaps non-profits will snub them for not using a permissive license? [IPython gets a $1.15M sloan foundation grant]. [93 pages of patches]: https://github.com/ipython/ipython/pulls?direction=desc&page=1&sort=created&state=closed [ipython gets a $1.15m sloan foundation grant]: http://ipython.org/sloan-grant.html [microsoft donated $100k to ipython]: http://ipython.org/microsoft-donation-2013.html ## Conclusion A collaborative open source effort is passionate and self-interested parties coming together to be constructive. Laozi, 老子, pioneered the concept of Wu wei (無爲), we do without doing. In a totally uncontrived way, without requirement, people from around the world crossed political, language and geographical barriers to bring together a common creation. It furthers one to see that. Whether you are a copyleft, academic, private or none of the above. Providing your data under a permissive license will open your work to the world. - [Open Data Commons Attribution License (ODC-By) v1.0] - [MIT License] [mit license]: _http://opensource.org/licenses/MIT [open data commons attribution license (odc-by) v1.0]: http://opendatacommons.org/licenses/by/1.0 --- # Internal Design decisions Source: https://cihai.git-pull.com/design-and-planning/2013/internals/ --- orphan: true --- (design-and-planning-2013-internals)= # Internal Design decisions ## Convenient relational, cohesive output of datasets Whether you are Cihai, python, creating a CJK tool in another programming language, or simple looking to use a dataset, cihai will provide tools to turn raw datasets into a familiar table / relational friendly format. Certain datasets may also offer to download the latest datasets from a server as well. Cihai is a tool to convert CJK datasets to a common, relational format and provide a convenent Python API for studying deeply. ## Bootstrapping Cihai's core functionality relies on a `ForeignKeyRestraint` against an entry on a master table of chinese characters / string which are assigned an integer ID for use as a `ForeignKey`. This way, datasets added to Cihai can achieve fast lookups and JOIN's via integers instead of unicode characters. If the word has multiple characters, Cihai will create the character ID for you and cite it for you and all future entries added to your dataset will reference it. This makes installing datasets in Cihai expensive (requiring a lookup of a unicode string in a central table) before inserting into database, with the benefit of potentially returning a huge cross-section of available CJK data in one swift stroke. --- # Planning Source: https://cihai.git-pull.com/design-and-planning/2013/spec/ --- orphan: true --- (design-and-planning-2013-spec)= # Planning :::{note} This document is part of brain storming the project of cihai. It's for historic purposes only. ::: _Written Late 2013_ Scribblings on cihai dev. ## Configuration It can accept a custom configuration file via command line with `-c`: ```console $ python -m cihai -c myconfig.yml ``` Where your configuration file overrides the default settings. You can see the default settings in the `cihai` package as `config.yml`. Developers may use `dev/config.yml`. The TestCase will use the `test_config.yml`. ```console $ python -m cihai ``` Will start up cihai with normal configuration settings. A configuration file may also be used. ```console $ python -m cihai -c dev/config.yml ``` ## History of CJK libraries ### Unihan Unihan, which is short for "Han Unification" is a standard published by the Unicode Consortium for CJK ideographs (also interchangeable referred to as "glyphs", "characters", "chars"). [Unihan's History] goes into greater detail on this. The first electronic release was in July 1995 as [CJKXREF.TXT] (961 kB). The second release, which resembles the formatting used in modern versions, was released in July 1996 with Unicode 2.0 as [Unihan-1.txt]. In an accident, the `Unihan-1.txt` (7.9MB) file was missing the final pieces after `U+8BC1`, no corrected version was made available. In May 1998, [Unihan-2.txt] was released with Unicode 2.1.2. Unihan Inc. is the center of the universe for all glyphs. For those who study Egyptian hieroglyphics, which are still mysterious, they are covered in Unicode block [U+13000..U+1342F]. [u+13000..u+1342f]: Fhttp://en.wikipedia.org/wiki/Egyptian_Hieroglyphs_(Unicode_block) [unihan's history]: http://www.unicode.org/reports/tr38/#History [cjkxref.txt]: http://www.unicode.org/Public/1.1-Update/CJKXREF.TXT [unihan-1.txt]: http://www.unicode.org/Public/2.0-Update/Unihan-1.txt [unihan-2.txt]: http://www.unicode.org/Public/2.1-Update/Unihan-2.txt ### cjklib [cjklib](cjklib) is a major python library created by Christoph Burgmer for han character research. "Cjklib provides language routines related to Han characters (characters based on Chinese characters named Hanzi, Kanji, Hanja and chu Han respectively) used in writing of the Chinese, the Japanese, infrequently the Korean and formerly the Vietnamese language(s). Functionality is included for character pronunciations, radicals, glyph components, stroke decomposition and variant information. Cjklib is implemented in Python." ### Cihai Early iterations of Cihai focused on external API as a first. Every data set was to be a plugin. The idea was, [Hanzi], a similar project in nodejs could share a similar API and datasets could be universal. The potential would be to provide two high-quality libraries for python and node, which are extendable to new data sets and reduce duplication. It is better to take the time to discover the variable nature of datasets and how they interconnect. ### Current The next iteration of cihai is to grasp an understand of: - what different data sets look like, how they return data? - is their commonality between all? - how their results can illicit deeper research and exploring of chinese characters This is an exploration phase. ## External API ### Cihai Spec Both Cihai and Hanzi libraries can use a similar API. - Reduce duplicated effort - Provide a main, tested CJK library to Python and node - Collaborate to assure both projects have access to open data sets and chinese character techniques. Larger charter: - Workgroup to develop a specification for core, pluggable CJK library across various programming languages. - follow best practices. - documentation - unit tests / ci - consistent with coding idiom's / pragmas (pythonic / pocoo / reits, connect / underscore / node) - be available on package archives (npm, pypi). - Across languages, core tools should have similar API method names, creating instance of data retrieval object - Extendable to new datasets as middleware. - Documentation for creating a new middleware. - Find more data sets and encourage data providers / data owners to use an open data license. - Find more libraries across various programming language with a CJK tool. - If project is a duplicate effort, notify that there is another effort underway and they can participate. - If project is a new tool: - see if they have a dataset. If it does, see license of ODC/OBDC. - see if their library is BSD or MIT. If not see if they're willing to license as such. \* - see if they are willing to use the Workgroup's API specification. - If willing, but no time, offer to patch. - If not interested at all, create an adapter for the project as a separate effort. * if the library is GPL, it can cause conflict down the road, if the project author does not have the time / interest in adopting specification, even creating an adapter to their project could trigger GPL. ## Licensing ### Core software BSD or MIT. The Core apps should be BSD 3-clause to protect the name of the app (Cihai or Hanzi). ### Extensions / Contrib licensing Middleware can be included in the project as officially supported. Contrib and third party plugins can be available under BSD or MIT. ### Data sets Data for chinese should be available under the most permissive license possible. ## How should data be looked up? I would like to try to encourage use of a single, simple hook, `.get`. After `.get` is used, the arguments may then be passed through middleware classes / methods. The same principle applies for `.reverse` matches. ### Chinese character Currently, Hanzi uses: ```{code-block} javascript hanzi.decompose('爱') // transition to: hanzi.get('爱') hanzi.reverse('爱') // to look up any indices / decompositions / words where 爱 may match. ``` Currently cjklib uses: ```{code-block} python cjk.getStrokeOrder(u'说') # transition to: cjk.get('说') ``` ```{code-block} python Cihai.get('好') ``` ### String of Chinese Characters Use `.get` too. This may seem problematic, but checking the `.length` or `len()` of the argument can suffice. ```{code-block} javascript var decomposition = hanzi.decomposeMany('爱橄黃'); // transition to var decomposition = hanzi.get('爱橄黃'); ``` ```{code-block} python Cihai.get('爱橄黃') ``` ## How should data returned look? Schema. Questions: - Is there already an open standard that can be adopted? - Should `.get` return an raw object / dict or an object: ``` c = c.get('你') # return a ResultObject / Backbone.Model / mongoose # document type of object. c.toJSON() # backbone / sqlalchemy style ``` The data should follow the same schema. What would an API response for these possibilities look like? If something generic like .get() is entered, - character decomposition - a unihan field ('kDefinition', 'kStrokes', 'kFrequency', ...) - - If `.get` is the only way to retrieve hits, more possibilities exist. For hanzi/node: ```{code-block} javascript results = hanzi.get('你好。怎么样?') ``` or for cihai/python: ```{code-block} python results = cihai.get('你好。怎么样?') ``` May return hits jieba middleware (jieba doesn't exist in node yet): ``` results.words = [ '你好', '怎么样' ] ``` The user may then further tool: ```{code-block} python for word in results.words: print(cihai.get(word)) ``` or ```{code-block} javascript for _.each(results.words, function(word) { console.log(hanzi.get(word)) }); ``` :::{warning} If dictionaries / datasets are extensible, there may be collision if they can reserve keys in the official result namespace. ::: Two plugins may could try to reserve `.words` as a name. Many dictionaries would want to reserve `.definition` as a name. To counteract this, a namespace can be adopted for middleware, we can have the Core resolve the conflict: 1. Append underscore + number on conflict, etc. (`c.definition_1`, `c.definition_2`): The first middleware using `words` can get `result.words`. The middleware called after will get `results.words_1`. This is seen in [SQLAlchemy's labels] to [avoid label collisions]. 2. Middleware / datasets use namespace with `_` (`c.unihan_kDefinition`): Pros: - iterable access to python `c.keys()` and `for var key in dict` in js. - all data returned can be accessed without nesting into dotted namespaces. Cons: - `result.unihan_kDefinition_these_things_getlong` - extension name and word separation can be confused. 3. Middleware may use dot namespace (`c.unihan.kDefinition`) Pros: - Internal Core API is far simpler and lighter - Easier to look at - More common practice, [aws_cli]. - Middleware is a package module, symbolically `.`'s are used to separate modules and packages (java, python, informally in JS). [sqlalchemy's labels]: https://github.com/zzzeek/sqlalchemy/blob/347e89044ce53ef0ec8d07937cd8279e9c4e5226/lib/sqlalchemy/sql/elements.py#L2393 [avoid label collisions]: https://github.com/zzzeek/sqlalchemy/blob/347e89044ce53ef0ec8d07937cd8279e9c4e5226/test/sql/test_compiler.py#L2549 [aws_cli]: https://github.com/aws/aws-cli ## Extension philosophy The middleware approach provides the best practice to get the job done. [Connect] in node represents the best practice in plugin architecture in JS. Middleware is added as a way to provide a lite, dead-simple framework. Cihai / Hanzi can take a similar approach. Hanzi can take example directly from connect's approach. It is clean and proven. Cihai can note middleware is already used in Django, packages can be maintained using pattern for Flask extensions and sphinx. Flask already has experience / lesson's heard from packaging and namespacing extensions. It can use the same data sets, similar API and extension strategy. [connect]: https://github.com/senchalabs/connect ## Accessing extensions directly? Perhaps extensions can also be searched directly: ``` c.unihan.get('好') ``` Third party API's can specify optional extra arguments, for instance, unihan may allow searching by one field: ``` c.unihan.get('好', 'kDefinition') ``` This allows a simple way to "drill down" cjk data across extensions. ## API examples Example: ```{code-block} python obj = unihan.get('好') retrieves all rows. it will create a keyed object: obj.kDefinition obj['kDefinition'] obj.keys() ['kDefinition',] obj = unihan.get('好', 'kDefinition', ...) >>> obj.kDefinition good >>> obj.kStrokes None ``` ## Creating a cihai plugin ```{code-block} python class Unihan(Cihai.Contrib): """ Utilizing a parent class can allow raising ``NotImplementedError`` errors. Further, this can provide access to a ``db``. However, ultimately, the only thing that's really required is:: class Example(object): def get(self, char): return { 'char': char } """ def get(self): pass def install(self): pass cihai = Cihai() cihai.use(Unihan) # register the middleware with c = cihai.get('好') >>> c.keys() ['unihan'] >>> c.get('好') >>> print(c.get('好')) >>> print(c.get('好').parent) # Below this point, libunihan splits into subplugins for its libraries. >>> print(dict(c.get('好'))) ``` Cihai will allows extensibility to new dictionaries, vocabularies and data. Middleware allows an arbitrary plugin to make data available. By default, `Cihai()` creates an instance of Cihai with access to {meth}`Cihai.get`. However, since no middleware are included with Cihai, no results are returned. With `Cihai(middleware=[Cihai.Unihan])` or `c = Cihai()` `c.use(Cihai.Unihan)` the Cihai_Unihan is available. What is Cihai_Unihan? Simply an object with: ```{code-block} python class Unihan(Cihai.Contrib): pass ``` [hanzi]: https://github.com/nieldlr/Hanzi --- # Internals Planning Source: https://cihai.git-pull.com/design-and-planning/2017/spec/ --- orphan: true --- (design-and-planning-2017-spec)= # Internals Planning Created 2017-04-29 1. {ref}`zero_config` - cihai should be able to work without configuration with a default data backend. 2. {ref}`incremental_config` cihai should be incrementally configurable, such as by specifying where data should be outputted. 3. {ref}`relational_backend` cihai will use [SQLAlchemy] as a database backend to story information for retrieval. 4. {ref}`automatic_extensions` cihai will make data accessible to third party libraries if they exist in the script's site-packages. e.g. If [pandas] is found, it will be able to return a {class}`pandas.DataFrame` for a queried set of information. 5. {ref}`unihan_core` cihai will use [UNIHAN] as a core and source of truth for information, as it contains all the glyphs and is reliable, free and well-maintained, and provides are good source of starter information. 6. {ref}`data_normalization` cihai will adopt a standard data format to store additional CJK data sets within. 7. {ref}`data_liberation` cihai libraries will be available under permissive licenses. (zero-config)= ## Zero config cihai will be able to be used immediately without a user configuring their system. cihai will conform with the [XDG specification] for determining where to check out data to. This includes: - Where to store downloaded source files, e.g. _XDG_CACHE_HOME/cihai/downloads_ - Where store default backend data, e.g. _XDG_DATA_DIRS/cihai/data_, as well as the default file name used within, e.g. _data.sqlite_ - Where to check for configuration files, e.g. _XDG_CONFIG_HOME/cihai_, as well as the default file name used within, e.g. _cihai.yaml_ These default directories will be where cihai will, by default, store information and search for configuration used in {ref}`incremental_config`. (incremental-config)= ## Incremental configuration The [SQLAlchemy] data backend used, which for SQLite, also includes the file path used to store the SQLite file, is customizable. [xdg specification]: https://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html (relational-backend)= ## Relational backend cihai will be powered by a relational database backend. Most python distributions include support for [SQLite], which in conjunction with {ref}`zero configuration `, makes for data store that will work across a wide array of systems. The data that cihai organizes will be primarily indexable by the glyph, and joined upon the glyph to pull in an ever expanding assortment of information on that character. (automatic-extensions)= ## Automatic extension detection Don't reinvent the wheel, interoperate. cihai will check for libraries such as pandas and other tabular libraries to easily produce native objects for the user based on their cihai data lookup. This comes at no performance penalty since the ability to export in a third party object, such as a {class}`pandas.DataFrame`. (unihan-core)= ## UNIHAN core cihai's library of CJK information will be backed on the reliable [UNIHAN] database, which is a approved by the Unicode Consortium. ### Operation It is to be determined if UNIHAN will be vendorized in the packaging or retrieved remotely. (data-normalization)= ## Data normalization CJK datasets made available by cihai and contributors should follow, a yet to be determined, standard for keeping data conserved, readily available and sustainable. ### Standards The initial consideration, since 2013, was datasets would follow [Data Packages]. In place of Data Packages, a simpler, and more lax guideline and alone with python interfaces may be considered. This determination is pending further review of datasets. In place of frictionlessdata's data package libraries, cihai may option for a simpler, yet more powerful system for making tabular data. (data-liberation)= ## Data liberation CC-0, MIT, ISC, BSD. Data sets should be available under licenses free from unintended side effects of derivative creation. [sqlite]: https://sqlite.org/ [pandas]: http://pandas.pydata.org/ [sqlalchemy]: https://www.sqlalchemy.org [unihan]: http://www.unicode.org/reports/tr38/ [data packages]: http://frictionlessdata.io/data-packages/ --- # Extensions Source: https://cihai.git-pull.com/design-and-planning/2018/plugin-system/ --- orphan: true --- (design-and-planning-2018-plugin-system)= # Extensions Initially discussed in #131 [^id6] The provisional recommended naming convention is this: ``` cihai_{dataset}(?_{extension}) cihai_unihan cihai_unihan_variants ``` # Benefits Automatic configuration Datasets are automatically namespaced as configs. Recipricocally, they have access to the instance of cihai's configuration. With the database mixin, your dataset can be automatically configured and work with your user's configured DB backend. By default, it's SQLite! Reciprocally, you can also access databases and tables connected to cihai. ```{code-block} python c = Cihai() c.add_dataset('unihan') # install package c.unihan.lookup('好') # bootstraps c.add_dataset('unihan', namespace='unihan2') # install package # checks c.unihan2.lookup('好') ``` The optional namespace= allows for cihai to allow root-level access to datasets, while being able to deprecate / move a dataset (however unlikely) if it were to conflict with a new method name / property on the main cihai object. It also allows namespacing a forked dataset, and adding it: ```{code-block} python c.add_dataset('my_forked_unihan', namespace='unihan') # install package ``` In the future, with libvcs: ```{code-block} python c.add_dataset(Unihan, namespace='unihan') # raw class ``` Future possibilities: This makes it possible to develop locally, make a touch adjustment, maintain a VCS branch, in the event a dataset fall out of sync or you want to hack on it / fork it. ```{code-block} python c.add_dataset('package.to.unihan.Unihan', namespace='unihan') # import string ``` # Versioning Please provide a **version** on your package if you distribute it. This is to make sure cihai <-> dataset <-> extension have the fresh / new features and data, but can also lock API's so production cases don't break. # Extending datasets You can also create packages, or even pure functions, that extend datasets. Extensions for datasets have access to the dataset's configuration, the sqlalchemy database (if it used it), and any other data-access it made available. For instance, if it had a custom data backend, it could make that available to the extension for it to use. ```{code-block} python c.add_dataset('unihan') c.unihan.add_extension(Variants) The same optional namespace= is possible: c.unihan.variants.lookup('好') c.unihan.add_extension(Variants, namespace='variants2') c.unihan.variants2.lookup('好') ``` For the first draft, pointing straight to the package -> module -> object via import string it the surest thing (since this is compatible with the user's local python package environment and works well regardless of developing or general usage): c.unihan.add_extension('package.to.import.unihan.Unihan') This is similar to the way FLASK_CONFIG points to an object inside of a python module. # Todo - Allow cihai to install packages via pip. # History Early cihai ideas made SQLAlchemy a requirement. The initial plan was to keep everything under a single namespace, database, and be able to reduce queries by building big queries. This is phased out in turn of making cihai easy to hack on. # Idea: pip-based add_dataset/add_extension For development / hacking purposes, all of the same file, and vcs still exist: ```{code-block} python # import string c.add_dataset('package.to.unihan', classname='Unihan', namespace='unihan') c.add_dataset( 'git+https://github.com/moo/cihai-unihan#test-branch', classname='Unihan', namespace='unihan' ) c.add_dataset('./path/to/dataset', classname='Unihan', namespace='unihan') c.unihan.add_extension('cihai_unihan_variants') c.unihan.add_extension( 'git+https://github.com/moo/cihai-unihan#test-branch', namespace='unihan' ) c.unihan.add_extension('./path/to/dataset', classname=Unihan, namespace='unihan') ``` # Idea: Namespacing Of the now, the idea is to avoid overengineering / bureaucracy caused by adopting setuptools namespacing. Be like Django, which doesn't enforce package naming. cihai.extensions.datasetname.extensionname, but that has difficulties [^id7] This namespace / organzation makes it possible for cihai to detect. [^id8] cihai-contrib makes packages available under cihaicontrib, similar to sphinx-contrib's structure [^id9] uses Python's namespaces [^id10] # See Also [^id6]: Add variant methods. Github issues for cihai. . Accessed September 1st, 2018. [^id7]: Flask's deprecration of flask.ext and flask_ext: Accessed September 1st, 2018. [^id8]: Sphinx extensions . Accessed September 1st, 2018. [^id9]: sphinx-contrib Accessed September 1st, 2018. [^id10]: Python namespaces. Accessed September 7th, 2018. --- # Design and Planning Source: https://cihai.git-pull.com/design-and-planning/ (design-and-planning)= # Design and Planning ## 2018 - {ref}`design-and-planning/2018/plugin-system` ## 2017 - {ref}`design-and-planning/2017/spec` ## Late 2013 These were part of the initial brain storming the project and preserved for historic purposes only. - {ref}`design-and-planning/2013/extending` - {ref}`design-and-planning/2013/internals` - {ref}`design-and-planning/2013/information_liberation` - {ref}`design-and-planning/2013/spec` ## Late 2012 - [State of datasets] on cjklib tracker [state of datasets]: https://github.com/cburgmer/cjklib/issues/3 --- # Glossary Source: https://cihai.git-pull.com/glossary/ (glossary)= # Glossary ```{eval-rst} .. glossary:: CJK 1. In computer software, `internationalization `_ of *Chinese, Japanese, and Korean* language. 2. In cihai, specifically, character information from Chinese, Japanese and Korean languages. Such as definitions, dictionary index references, phonetics, character decompositions and stroke information (order and amount). UNIHAN A character database of CJK information provided by the Unicode Consortium. See the documentation as http://www.unicode.org/reports/tr38/. cjklib A popular CJK library in python created by Christoph Burgmer cihai 1. A CJK library in python built from the ground up under a permissive license and modern python development practices 2. A workgroup for finding, digitizing, and preserving CJK dataasets SQLAlchemy A relational database library used to store and retrieve character information in cihai Data Packages A standard for storing data, see http://frictionlessdata.io/data-packages/ XDG Base Directory A specification for directory locations designed to work across platforms. See https://specifications.freedesktop.org/basedir-spec/basedir-spec-0.6.html. ``` [internationalization]: https://en.wikipedia.org/wiki/Internationalization_and_localization --- # Changelog Source: https://cihai.git-pull.com/history/ (history)= ```{include} ../CHANGES ``` --- # cihai Source: https://cihai.git-pull.com/ (index)= # cihai Python library for {term}`CJK` (Chinese, Japanese, Korean) character data. Look up readings, definitions, and variants from the [UNIHAN](datasets/unihan.md) database and beyond. ::::{grid} 1 2 3 3 :gutter: 2 2 3 3 :::{grid-item-card} Quickstart :link: quickstart :link-type: doc Install and make your first lookup in 5 minutes. ::: :::{grid-item-card} Topics :link: topics/index :link-type: doc Features, examples, extending, troubleshooting. ::: :::{grid-item-card} API Reference :link: api/index :link-type: doc Every public class, function, and exception. ::: :::: ::::{grid} 1 2 3 3 :gutter: 2 2 3 3 :::{grid-item-card} Datasets :link: datasets/index :link-type: doc UNIHAN and planned data sources. ::: :::{grid-item-card} Internals :link: internals/index :link-type: doc Private APIs -- no stability guarantee. ::: :::{grid-item-card} Contributing :link: project/index :link-type: doc Development setup, code style, release process. ::: :::: ## Install ```console $ pip install cihai ``` ```console $ uv add cihai ``` ## At a glance ```python from cihai.core import Cihai c = Cihai() if not c.unihan.is_bootstrapped: # download and install UNIHAN to db c.unihan.bootstrap() query = c.unihan.lookup_char('好') glyph = query.first() print("lookup for 好: %s" % glyph.kDefinition) # lookup for 好: good, excellent, fine; well query = c.unihan.reverse_char('good') print('matches for "good": %s ' % ', '.join([glph.char for glph in query])) # matches for "good": 㑘, 㑤, 㓛, 㘬, 㙉, 㚃, ... ``` See [Quickstart](quickstart.md) for detailed installation and first steps. ```{toctree} :hidden: quickstart topics/index api/index datasets/index internals/index project/index design-and-planning/index history glossary GitHub ``` --- # Config reader - cihai._internal.config_reader Source: https://cihai.git-pull.com/internals/api/config_reader/ # Config reader - `cihai._internal.config_reader` ```{eval-rst} .. automodule:: cihai._internal.config_reader :members: :undoc-members: :show-inheritance: :no-value: ``` --- # Internal API Source: https://cihai.git-pull.com/internals/api/ (internal_api)= # Internal API ```{module} cihai ``` :::{warning} Be careful with these! Internal APIs are **not** covered by version policies. They can break or be removed between minor versions! If you need an internal API stabilized please [file an issue](https://github.com/cihai/cihai/issues). ::: ```{toctree} :caption: Internal API :maxdepth: 1 config_reader types ``` --- # Typings - cihai._internal.types Source: https://cihai.git-pull.com/internals/api/types/ # Typings - `cihai._internal.types` ```{eval-rst} .. automodule:: cihai._internal.types :members: :undoc-members: :show-inheritance: :no-value: ``` --- # Internals Source: https://cihai.git-pull.com/internals/ (internals)= # Internals :::{danger} **No stability guarantee.** Internal APIs are **not** covered by version policies. They can break or be removed between any minor versions without notice. If you need an internal API stabilized please [file an issue](https://github.com/cihai/cihai/issues). ::: ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} Config Reader :link: api/config_reader :link-type: doc Internal configuration file loading and expansion. ::: :::{grid-item-card} Types :link: api/types :link-type: doc Internal type aliases and protocols. ::: :::: ```{toctree} :hidden: api/index ``` --- # Code Style Source: https://cihai.git-pull.com/project/code-style/ # Code Style cihai follows consistent coding standards across all repositories in the cihai organization. ## Formatting and linting [ruff](https://ruff.rs) handles formatting, import sorting, and linting in a single tool. ```console $ uv run ruff check . ``` ```console $ uv run ruff format . ``` Auto-fix safe lint violations: ```console $ uv run ruff check . --fix --show-fixes ``` ## Type checking [mypy](http://mypy-lang.org/) with `strict = true` is used for static type checking. ```console $ uv run mypy . ``` ## Docstrings Use [NumPy-style](https://numpydoc.readthedocs.io/en/latest/format.html) docstrings with reStructuredText markup. ## Imports - Use `from __future__ import annotations` at the top of every module. - Use namespace imports for the standard library: `import pathlib` rather than `from pathlib import Path`. - Use `import typing as t` and access members via `t.NamedTuple`, etc. --- # Contributing Source: https://cihai.git-pull.com/project/contributing/ (contributing)= (developing)= (workflow)= # Contributing As an open source project, all cihai projects accept contributions through GitHub, GitLab and Codeberg. Below you will find resources on the internals of the project. :::{note} This guide applies to all cihai projects, not just the cihai repo. ::: Cihai projects use standard conventions and patterns based on best practices in python. To be efficient at debugging, developing, testing, documenting, etc. it helps to familiarize yourself with the tool within, independently if needed. `` can be assumed to be an existing or future cihai project, including [cihai](https://github.com/cihai/cihai), [cihai-cli](https://github.com/cihai/cihai-cli), [unihan-etl](https://github.com/cihai/unihan-etl), [unihan-db](https://github.com/cihai/unihan-db). See [GitHub](https://github.com/cihai), [GitLab](https:/gitlab.com/cihai) and [Codeberg](https://codeberg.org/cihai). ## Development environment [uv] is a required package to develop. ```console $ git clone https://github.com/cihai/.git ``` ```console $ cd ``` So if `` is [cihai]: ```console $ git clone https://github.com/cihai/cihai.git ``` ```console $ cd cihai ``` ## Install dependencies ```console $ uv sync --all-extras --dev ``` Justfile commands prefixed with `watch-` will watch files and rerun. ## Tests [pytest] is used for tests. ```console $ uv run py.test ``` ### Rerun on file change via [pytest-watcher] (works out of the box): ```console $ just start ``` via [`entr(1)`] (requires installation): ```console $ just watch-test ``` [pytest-watcher]: https://github.com/olzhasar/pytest-watcher ### Manual (just the command, please) ```console $ uv run py.test ``` or: ```console $ just test ``` ### pytest options _For filename / test names within, examples will be for [cihai], if using a different cihai project check the filename and test names accordingly_: `PYTEST_ADDOPTS` can be set in the commands below. For more information read [docs.pytest.com] for the latest documentation. [docs.pytest.com]: https://docs.pytest.org/ Verbose: ```console $ env PYTEST_ADDOPTS="-verbose" just start ``` Pick a file: ```console $ env PYTEST_ADDOPTS="tests/test_cihai.py" just start ``` Drop into `test_cihai_version()` in `tests/test_cihai.py`: ```console $ env PYTEST_ADDOPTS="-s -x -vv tests/test_cihai.py" just start ``` Drop into `test_cihai_version()` in `tests/test_cihai.py` and stop on first error: ```console $ env PYTEST_ADDOPTS="-s -x -vv tests/test_cihai.py::test_cihai" just start ``` Drop into `pdb` on first error: ```console $ env PYTEST_ADDOPTS="-x -s --pdb" just start ``` If you have [ipython] installed: ```console $ env PYTEST_ADDOPTS="--pdbcls=IPython.terminal.debugger:TerminalPdb" just start ``` [ipython]: https://ipython.org/ ```console $ just test ``` You probably didn't see anything but tests scroll by. If you found a problem or are trying to write a test, you can file an on the tracker for the relevant cihai project. (test-specific-tests)= ### Manual invocation Test only a file: ```console $ py.test tests/test_cihai.py ``` will test the `tests/test_cihai.py` tests. ```console $ py.test tests/test_cihai.py::test_cihai_version ``` tests `test_cihai_version()` inside of `tests/test_cihai.py`. Multiple can be separated by spaces: ```console $ py.test tests/test_{conversion,exc}.py tests/test_config.py::test_configurator ``` ## Documentation [sphinx-autobuild] will automatically build the docs, watch for file changes and launch a server. From home directory: `just start-docs` From inside `docs/`: `just start` [sphinx-autobuild]: https://github.com/executablebooks/sphinx-autobuild ### Manual documentation (the hard way) `cd docs/` and `just html` to build. `just serve` to start http server. Helpers: `just build-docs`, `just serve-docs` Rebuild docs on file change: `just watch-docs` (requires [entr(1)]) Rebuild docs and run server via one terminal: `just dev-docs` ### View documentation locally To find the URL of the preview server, read the terminal, the URL may very depending on the project! An example of what to look for: ```console [I 220816 14:43:41 server:335] Serving on http://127.0.0.1:8035 ``` ## Formatting / Linting ### ruff The project uses [ruff] to handle formatting, sorting imports, and linting. ````{tab} Command uv: ```console $ uv run ruff ``` If you setup manually: ```console $ ruff check . ``` ```` ````{tab} just ```console $ just ruff ``` ```` ````{tab} Watch ```console $ just watch-ruff ``` requires [`entr(1)`]. ```` ````{tab} Fix files uv: ```console $ uv run ruff check . --fix ``` If you setup manually: ```console $ ruff check . --fix ``` ```` #### ruff format [ruff format] is used for formatting. ````{tab} Command uv: ```console $ uv run ruff format . ``` If you setup manually: ```console $ ruff format . ``` ```` ````{tab} just ```console $ just ruff-format ``` ```` ### mypy [mypy] is used for static type checking. ````{tab} Command uv: ```console $ uv run mypy . ``` If you setup manually: ```console $ mypy . ``` ```` ````{tab} just ```console $ just mypy ``` ```` ````{tab} Watch ```console $ just watch-mypy ``` requires [`entr(1)`]. ```` ## Releasing Since this software used in production projects, we don't release breaking changes until there's a major feature release. Choose what the next version is. Assuming it's version 0.9.0, it could be: - 0.9.0post0: postrelease, if there was a packaging issue - 0.9.1: bugfix / security / tweak - 0.10.0: breaking changes, new features Let's assume we pick 0.9.1 `CHANGES`: Assure any PRs merged since last release are mentioned. Give a thank you to the contributor. Set the header with the new version and the date. Leave the "current" header and _Insert changes/features/fixes for next release here_ at the top: ```markdown ## package-name 0.10.x (unreleased) - _Insert changes/features/fixes for next release here_ ## package-name 0.9.1 (2020-10-12) - :issue:`1`: Fix bug ``` `package_name/__init__.py` and `__about__.py` - Set version ```console $ git commit -m 'Tag v0.9.1' ``` ```console $ git push ``` Important: Create and push the tag. Make sure the version is correct and the `pyproject.toml` and `__about__.py` match the version being deployed. ```console $ git tag v0.9.1 ``` ```console $ git push --tags ``` ### Automated deployment CI will automatically push to the PyPI index when a tag is pushed. ### Manual deployment [uv] handles virtualenv creation, package requirements, versioning, building, and publishing. Therefore there is no setup.py or requirements files. Update `__version__` in `__about__.py` and `pyproject.toml`:: git commit -m 'build(cihai): Tag v0.1.1' git tag v0.1.1 git push git push --tags GitHub Actions will detect the new git tag, and in its own workflow run `uv build` and push to PyPI. [uv]: https://github.com/astral-sh/uv [entr(1)]: http://eradman.com/entrproject/ [`entr(1)`]: http://eradman.com/entrproject/ [ruff]: https://ruff.rs [mypy]: http://mypy-lang.org/ --- # Project Source: https://cihai.git-pull.com/project/ (project)= # Project Information for contributors and maintainers. ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} Contributing :link: contributing :link-type: doc Development setup, running tests, submitting PRs. ::: :::{grid-item-card} Code Style :link: code-style :link-type: doc Ruff, mypy, NumPy docstrings, import conventions. ::: :::{grid-item-card} Releasing :link: releasing :link-type: doc Release checklist and version policy. ::: :::: ```{toctree} :hidden: contributing code-style releasing ``` --- # Releasing Source: https://cihai.git-pull.com/project/releasing/ # Releasing Since cihai is used in production projects, breaking changes are deferred until a major feature release. ## Version numbering Given a current version of `0.36.0`: - **0.36.0post0** -- post-release, packaging fix only - **0.36.1** -- bugfix / security / tweak - **0.37.0** -- new features or breaking changes ## Release checklist 1. Update `CHANGES` -- ensure every merged PR since the last tag is listed. Set the header to the new version and today's date. Keep the *unreleased* placeholder at the top. 2. Bump the version in `pyproject.toml` and `src/cihai/__about__.py`. 3. Commit and tag: ```console $ git commit -m 'Tag v0.36.1' ``` ```console $ git tag v0.36.1 ``` ```console $ git push && git push --tags ``` ## Automated deployment GitHub Actions detects the new tag and runs `uv build` followed by a push to PyPI. ## Manual deployment If CI is unavailable: ```console $ uv build ``` ```console $ uv publish ``` --- # Quickstart Source: https://cihai.git-pull.com/quickstart/ (quickstart)= # Quickstart cihai is designed to work out-of-the-box without configuration. ## Installation ```console $ pip install --user cihai ``` (developmental-releases)= ### Developmental releases New versions of cihai are published to PyPI as alpha, beta, or release candidates. Identifiers like `a1`, `b1`, and `rc1` mark alpha, beta, and release candidates, respectively. - [pip]\: ```console $ pip install --user --upgrade --pre cihai ``` - [pipx]\: ```console $ pipx run --pip-args '\--pre' --spec 'cihai' python -c "import cihai; print(cihai.__version__)" ``` - [uv]\: ```console $ uv add cihai --prerelease allow ``` - [uvx]\: ```console $ uvx --from 'cihai' --prerelease allow python -c "import cihai; print(cihai.__version__)" ``` (configuration)= ## Configuration By default, cihai requires no configuration. The defaults file locations are {term}`XDG Base Directory` for the users' system, as well as SQLite to store, seek, and retrieve data. You can override cihai's default storage and file directories via a config file. The default configuration is at {attr}`cihai.constants.DEFAULT_CONFIG`. Database configuration accepts any SQLAlchemy {sqlalchemy:ref}`database_urls`. If you're using a DB other than SQLite, such as Postgres, be sure to install the requisite driver, such as [psycopg][psycopg]. [xdg directories]: https://specifications.freedesktop.org/basedir-spec/basedir-spec-0.6.html ### Advanced Config cihai is designed to allow you to incrementally override settings to your liking. Internally, the config is parsed through {func}`cihai.conf.expand_config`. This will replace environment variables, XDG variables and tildes. You can also enter absolute paths. Environmental variables require a dollar sign added to them, e.g. `${ENVVAR}`. XDG variables such as _user_cache_dir_, _user_config_dir_, _user_data_dir_, _user_log_dir_, _site_config_dir_, _site_data_dir_ are done via curly brackets only. E.g. `{site_config_dir}`. Tildes are just replaced. ```{code-block} yaml database: url: '${DATABASE_URL}' dirs: data: '{user_data_dir}/mydata' cache: '~/cache/cihai' logs: '$ENVVAR/logs' ``` In the example above, Heroku's [DATABASE_URL](https://devcenter.heroku.com/articles/heroku-postgresql#establish-primary-db) is replaced as an environmental variable. The XDG variable for _user_data_dir_ is combined with _mydata/_, which makes the data stored deeper. The environmental variable _$ENVVAR_ is also replaced. You may point to a custom config with the `-c` argument, `$ cihai -c path/to/config.yaml`. You can also override bootstrapping settings. The "unihan_options" dictionary in Cihai's configuration will be passed right to {ref}`unihan-etl:index`'s {class}`unihan_etl.core.Packager` `option` param, which is then merged on top of unihan-etl's default settings: ```{code-block} yaml unihan_options: source: 'https://custom-mirror.com/Unihan.zip' # local paths work too work_dir: '/path/to/unzip/files' zip_path: '/path/to/store/Unihan.zip' fields: ['kDefinition'] # and / or: input_files: ['Unihan_Readings.txt'] ``` [psycopg]: http://initd.org/psycopg/ [pip]: https://pip.pypa.io/en/stable/ [pipx]: https://pypa.github.io/pipx/docs/ [uv]: https://docs.astral.sh/uv/ [uvx]: https://docs.astral.sh/uv/guides/tools/ --- # Examples Source: https://cihai.git-pull.com/topics/examples/ (examples)= # Examples ## Basic usage _examples/basic_usage.py_: ```{literalinclude} ../../examples/basic_usage.py :language: python ``` ## Character variants _examples/variants.py_: ```{literalinclude} ../../examples/variants.py :language: python ``` _examples/variant_ts_difficulties.py_: ```{literalinclude} ../../examples/variant_ts_difficulties.py :language: python ``` --- # Extending cihai Source: https://cihai.git-pull.com/topics/extending/ (extend)= # Extending cihai Use cihai's abstraction and your dataset's users can receive easy configuration, SQL access, and be available in a growing list of CJKV information. ## Creating new dataset Expand cihai's knowledge! Create a {class}`cihai.extend.Dataset`. You can also make your dataset available in open source so other cihai users can use it! If you do, bring it up on the [issue tracker]! _examples/dataset.py_: ```{literalinclude} ../../examples/dataset.py :language: python ``` In addition, view our reference implementation of UNIHAN, which is incorporated as a dataset. See {class}`cihai.data.unihan.dataset.Unihan` [issue tracker]: https://github.com/cihai/cihai/issues ## Plugins: Adding features to a dataset Extend a dataset with custom behavior to avoid repetition. Create a {class}`cihai.extend.DatasetPlugin`. See our reference implementation of {class}`cihai.data.unihan.dataset.UnihanVariants` Datasets can be augmented with computed methods. These utilize a dataset to pull information out, but are frequently used / generic enough to write a An example of this would be the [suggestion to add variant lookups for UNIHAN](https://github.com/cihai/cihai/pull/131). ## Combining datasets Combining general datasets in general is usually considered general library usage. But if you're usage is common, saves from repetition, it's worth considering making into a reusable extension and open sourcing it. Using the library to mix and match data from various sources is what cihai is meant to do! If you have a way you're using cihai that you think would be helpful, definitely create an issue, a gist, github repo, etc! License it permissively please (MIT, BSD, ISC, etc!) --- # Features Source: https://cihai.git-pull.com/topics/features/ (features)= # Features - Handling CJK Variants cihai builds upon [UNIHAN]: "thousands of years worth of writing have produced thousands of pairs which can be used more-or-less interchangeably." For more information, see "Unification Rules" on page 679 of _The Unicode Standard_ ([.pdf](http://www.unicode.org/versions/Unicode9.0.0/ch18.pdf)). - Extensibie cihai will be able to pull remote CJK datasets. In addition, the handling of variants will create new ways to discover and interpret CJK characters while using these datasets. - Python API and CLI application Cihai can be used as a Python {ref}`API` as well as a command line application via `$ cihai`. - Asian encoding swiss army knife Functions under the hood such as {ref}`cihai.conversion ` are tested across python implementations to handle a growing assortment of Asian encodings. [unihan]: http://unicode.org/charts/unihan.html [variants]: http://www.unicode.org/reports/tr38/tr38-21.html#N10211 --- # Topics Source: https://cihai.git-pull.com/topics/ # Topics Explore cihai's capabilities and underlying concepts at a high level, with detailed explanations to help you understand its design and usage. ::::{grid} 1 1 2 2 :gutter: 2 2 3 3 :::{grid-item-card} Features :link: features :link-type: doc CJK variants, extensibility, encoding utilities. ::: :::{grid-item-card} Examples :link: examples :link-type: doc Annotated code samples for common tasks. ::: :::{grid-item-card} Extending :link: extending :link-type: doc Create datasets, plugins, and combine data sources. ::: :::{grid-item-card} Troubleshooting :link: troubleshooting :link-type: doc Common issues and their solutions. ::: :::: ```{toctree} :hidden: features examples extending troubleshooting ``` --- # Troubleshooting Source: https://cihai.git-pull.com/topics/troubleshooting/ (troubleshooting)= # Troubleshooting ## Python 2.7 and UCS Note, to get this working on python 2.7, you must have python built with _UCS4_ via `--enable-unicode=ucs4`. You can test for UCS4 with: ```{code-block} python >>> import sys >>> sys.maxunicode > 0xffff True ``` Most packaged and included python distributions will already be build with UCS4 (such as Ubuntu's system python). On python 3.3 and greater, this distinction no longer exists, no action is needed. ---