# Config - cihai.config
Source: https://cihai.git-pull.com/api/config/

# Config - `cihai.config`

```{eval-rst}
.. automodule:: cihai.config
   :members:
   :undoc-members:
   :show-inheritance:
```

---

# Constants - cihai.constants
Source: https://cihai.git-pull.com/api/constants/

# Constants - `cihai.constants`

```{eval-rst}
.. automodule:: cihai.constants
   :members:
   :undoc-members:
   :show-inheritance:
```

---

# Conversion - cihai.conversion
Source: https://cihai.git-pull.com/api/conversion/

(cihai.conversion)=

# Conversion - `cihai.conversion`

```{eval-rst}
.. automodule:: cihai.conversion
   :members:
   :undoc-members:
   :show-inheritance:
```

---

# Core - cihai.core
Source: https://cihai.git-pull.com/api/core/

# Core - `cihai.core`

```{eval-rst}
.. automodule:: cihai.core
   :members:
   :undoc-members:
   :show-inheritance:
```

---

# Database - cihai.db
Source: https://cihai.git-pull.com/api/db/

# Database - `cihai.db`

```{eval-rst}
.. automodule:: cihai.db
   :members:
   :undoc-members:
   :show-inheritance:
```

---

# Exceptions - cihai.exc
Source: https://cihai.git-pull.com/api/exc/

# Exceptions - `cihai.exc`

When using cihai via Python, you can catch Cihai-specific exceptions via these. All Cihai-specific
exceptions are catchable via {exc}`~cihai.exc.CihaiException` since its the base exception.

```{eval-rst}
.. automodule:: cihai.exc
   :members:
   :undoc-members:
   :show-inheritance:
```

---

# Extending - cihai.extend
Source: https://cihai.git-pull.com/api/extend/

# Extending - `cihai.extend`

```{eval-rst}
.. automodule:: cihai.extend
   :members:
   :undoc-members:
   :show-inheritance:
```

---

# API Reference
Source: https://cihai.git-pull.com/api/

(api)=

(reference)=

# API Reference

cihai's public API for CJK character lookups and dataset management.

:::{warning}
cihai is pre-1.0. APIs may change between minor versions. Pin to a range:
`cihai>=0.36,<0.37`.

If you need an API stabilized please [file an issue](https://github.com/cihai/cihai/issues).
:::

## Core

::::{grid} 1 1 2 2
:gutter: 2 2 3 3

:::{grid-item-card} Cihai (core)
:link: core
:link-type: doc
Application object. Bootstrap datasets, run lookups.
:::

:::{grid-item-card} Config
:link: config
:link-type: doc
Configuration loading and expansion.
:::

:::{grid-item-card} Database
:link: db
:link-type: doc
SQLAlchemy engine/session setup and helpers.
:::

:::{grid-item-card} Extend
:link: extend
:link-type: doc
Base classes for datasets and plugins.
:::

::::

## Supporting Modules

::::{grid} 1 2 3 3
:gutter: 2 2 3 3

:::{grid-item-card} Constants
:link: constants
:link-type: doc
Default paths and configuration values.
:::

:::{grid-item-card} Conversion
:link: conversion
:link-type: doc
CJK encoding conversion utilities.
:::

:::{grid-item-card} Exceptions
:link: exc
:link-type: doc
Exception hierarchy.
:::

:::{grid-item-card} Log
:link: log
:link-type: doc
Logging helpers.
:::

:::{grid-item-card} Types
:link: types
:link-type: doc
Public type aliases.
:::

:::{grid-item-card} Utils
:link: utils
:link-type: doc
Import and general utility helpers.
:::

::::

```{toctree}
:hidden:

core
config
constants
conversion
db
exc
extend
log
types
utils
```

---

# Logging - cihai.log
Source: https://cihai.git-pull.com/api/log/

# Logging - `cihai.log`

```{eval-rst}
.. automodule:: cihai.log
   :members:
   :undoc-members:
   :show-inheritance:
```

---

# Typings - cihai.types
Source: https://cihai.git-pull.com/api/types/

# Typings - `cihai.types`

```{eval-rst}
.. automodule:: cihai.types
   :members:
   :undoc-members:
   :show-inheritance:
```

---

# Utilities - cihai.utils
Source: https://cihai.git-pull.com/api/utils/

# Utilities - `cihai.utils`

```{eval-rst}
.. automodule:: cihai.utils
   :members:
   :undoc-members:
   :show-inheritance:
```

---

# Datasets
Source: https://cihai.git-pull.com/datasets/

(datasets)=
(data)=

# Datasets

Data sources available through cihai.

::::{grid} 1 1 2 2
:gutter: 2 2 3 3

:::{grid-item-card} UNIHAN
:link: unihan
:link-type: doc
Unicode Han Database -- readings, meanings, variants.
:::

::::

## Planned datasets

For all data sets, the goal is to achieve:

- Clear and permissive a licensing for public and private use
- Compatibility with [data Packages], for data to be language agnostic and consistent
- Open source scripting used to process data into a common format

| Set       | License           | Data Package           | Project           |
| --------- | ----------------- | ---------------------- | ----------------- |
| UNIHAN    | OK [^cite_unhn-l] | OK [^cite_unhn-d]      | OK [^cite_unhn-p] |
| edict     | OK                | TODO                   | TODO              |
| cedict    | OK [^cite_cdct-l] | TODO                   | TODO              |
| cedictgr  | OK                | TODO                   | TODO              |
| handedict | OK                | TODO                   | TODO              |
| cfdict    | OK                | MISSING [^cite_cfdict] | UNKNOWN           |

[data packages]: https://specs.frictionlessdata.io/data-package/

[^cite_unhn-l]: <http://unicode.org/charts/unihan.html#Disclaimers>
[^cite_unhn-d]: <https://raw.githubusercontent.com/cihai/unihan-etl/master/datapackage.json>
[^cite_unhn-p]: <https://unihan-etl.git-pull.com/>
[^cite_cdct-l]: <https://www.mdbg.net/chinese/dictionary?page=cedict>
[^cite_cfdict]: The database at <http://www.chine-informations.com/chinois/open/CFDICT/download.php> is missing.

```{toctree}
:hidden:

unihan
```

---

# UNIHAN - cihai.data.unihan
Source: https://cihai.git-pull.com/datasets/unihan/

# UNIHAN - `cihai.data.unihan`

### Bootstrapping

```{eval-rst}
.. automodule:: cihai.data.unihan.bootstrap
    :members:
```

```{eval-rst}
.. autoclass:: cihai.data.unihan.dataset.Unihan
   :members:
   :inherited-members:
   :show-inheritance:
```

```{eval-rst}
.. automodule:: cihai.data.unihan.constants
   :members:
   :inherited-members:
   :show-inheritance:
```

### Variants plugin

```{eval-rst}
.. autoclass:: cihai.data.unihan.dataset.UnihanVariants
   :members:
   :inherited-members:
   :show-inheritance:
```

---

# Extending
Source: https://cihai.git-pull.com/design-and-planning/2013/extending/

---
orphan: true
---

(design-and-planning-2013-extending)=

# Extending

:::{note}

This document is part of brain storming the project of cihai. It's for historic purposes only.

:::

_Written Late 2013_

## Minimum usage

1. Create a python module
2. The module has a class with a `get.()` to look up characters by signature
   `(request, response, *args, **kwargs)`.

With `.get()`, your class may be instantiated and passed into `Cihai`. When a user runs `.get()`
inside of `Cihai`, it will check your module's `.get()` also:

```
+----------+
| Cihai    |  The Cihai Class
+----------+
```

It is instantiated with a database to connect to ({class}`sqlalchemy.schema.MetaData`):

```{code-block} python

c = Cihai(metadata=metadata)

```

`MetaData` is part of the sqlalchemy library. It holds connection and table information. In this
instance, cihai shares this information across all plugins that attach to it.

To attach a plugin:

```{code-block} python

from MyCihaiModule import MyDataset
mydata = MyDataSet()
c.use(mydata)

```

`c`, the instance of {class}`Cihai`, may now access `MyDataSet`'s information.

### Code

```{code-block} python

c = Cihai()

c.use(DatasetExample)
print(c.reverse('hao'))
>>> {
    'definition': 'hao'
}

print(c.get('你'))
>>> {
    'definition': 'hao'
}

```

## Growing big

The above was an example of the minimum requirement to have your dataset compatible.

### Importing data into database

One of the goals of Cihai is to provide a common way to access to Chinese data. To import the data,
you must create an SQL schema / table for your data.

The pristine format of your data may be in CSV, excel or another format. As long as your data is
normalized into a {obj}`dict` that is compatible with the sql table, it is ok.

To accommodate this, {class}`Cihai` provides all plugins a instance of
{class}`sqlalchemy.schema.MetaData` on creation. [sqlalchemy][sqlalchemy] is the swiss army knife of
databases in the python programming language.

With an instance of `MetaData`, you will be able to create SQL tables, import and retrieve data.

### Deeper

In previous examples, the plugin class with `.get` and `.reverse` character lookups was merged with
1 SQL table.

As said previously, it doesn't matter how or where the data comes from. As long as {class}`Cihai`
can retrieve data via `.get` with the correct arguments and response. The prior example had the data
class combined with a single table.

In databases that use multiple tables, you may create a central dataset class with `.get()` and
access the tables from there.

[sqlalchemy]: http://www.sqlalchemy.org

---

# Information Liberation
Source: https://cihai.git-pull.com/design-and-planning/2013/information_liberation/

---
orphan: true
---

(design-and-planning-2013-information-liberation)=

# Information Liberation

:::{note}

This document is part of brain storming the project of cihai. It's for historic purposes only.

:::

_Written Late 2013_

Datasets should be available under the most permissive License possible. So long as they provide
attribution to the person's / official institution creating it.

Cihai believes in permissive license (Do as you like, for fun, academic, profit-making) but
attribute.

Common concerns over people over their datasets are:

- What if someone uses my dataset for profit purposes?
- What if someone doesn't give attribution to my / my colleagues / my institution / my effort?
- What if someone doesn't contribute modifications to my / my colleagues / my institution / my
  effort?

If you have not participated in an open source software effort, you would be surprised how people
are happy to contribute to a common effort. Efforts like GNU/Linux are world-wide collaborations
bringing together a rock-solid OS powering supercomputers, the internet.

## Permissive-licensing your dataset

Even if have brought into consideration the fruits of an open source software effort you should know
some of them are built upon restrictive, viral licenses which are commercially-unfriendly,
complicated and require borrowing _their_ code should mean any non-GPL compatible effort - private
_or_ freely available permissive, can't use it without turning the software into GPL too! But GPL
licensed software can use permissive software in their efforts! It opened my mind after years of
thinking GPL was all principle and virtue! Defend the weak! Protect the innocent!

_Freedom._

To add to the confusion, this is referred to by some people as "Free" software. In reality, it's
providing open source, but inhibits the real world realities of a value-added, passionate and
expressive society, where people want it to be their choice whether they distribute their changes.

If you ever built something special and felt your hardwork deserved to be placed into the market to
see how people receive it's value, you would understand. But others may find this selfish!

If you not seen a permissively-licensed software project, you would be surprised to see, people are
contributing these projects despite _there is no fine print requiring them to_.

Look into your roots, if you at your core want to open a useful work to the world - why hinder it
with a minefield a caveats?

The success of open source efforts isn't a product of rules, but a result of fast computers, fast
internet, convenient developer tools, brain power, and a community of self-interested / passionate
individuals who put in the hours and had the descipline to learn programming who had the courage and
desire to help.

### Case study: IPython

This is an observation and doesn't infer endorsement by IPython or its contributors.

There is no requirement for providing an open source derivative (or an upstream patch) for a
modification, so does IPython development go stale? [93 pages of patches]
committed to the project.

There is no requirement restricting large corporations from using them and giving nothing back,
[Microsoft donated $100K to IPython].

Perhaps academic institutions will snub them for using a permissive license? The core developers are
academics.

Perhaps non-profits will snub them for not using a permissive license? [IPython gets a $1.15M sloan
foundation grant].

[93 pages of patches]: https://github.com/ipython/ipython/pulls?direction=desc&page=1&sort=created&state=closed
[ipython gets a $1.15m sloan foundation grant]: http://ipython.org/sloan-grant.html
[microsoft donated $100k to ipython]: http://ipython.org/microsoft-donation-2013.html

## Conclusion

A collaborative open source effort is passionate and self-interested parties coming together to be
constructive.

Laozi, 老子, pioneered the concept of Wu wei (無爲), we do without doing. In a totally uncontrived
way, without requirement, people from around the world crossed political, language and geographical
barriers to bring together a common creation. It furthers one to see that.

Whether you are a copyleft, academic, private or none of the above. Providing your data under a
permissive license will open your work to the world.

- [Open Data Commons Attribution License (ODC-By)
  v1.0]
- [MIT License]

[mit license]: _http://opensource.org/licenses/MIT
[open data commons attribution license (odc-by) v1.0]: http://opendatacommons.org/licenses/by/1.0

---

# Internal Design decisions
Source: https://cihai.git-pull.com/design-and-planning/2013/internals/

---
orphan: true
---

(design-and-planning-2013-internals)=

# Internal Design decisions

## Convenient relational, cohesive output of datasets

Whether you are Cihai, python, creating a CJK tool in another programming language, or simple
looking to use a dataset, cihai will provide tools to turn raw datasets into a familiar table /
relational friendly format.

Certain datasets may also offer to download the latest datasets from a server as well.

Cihai is a tool to convert CJK datasets to a common, relational format and provide a convenent Python
API for studying deeply.

## Bootstrapping

Cihai's core functionality relies on a `ForeignKeyRestraint` against an entry on a master table of
chinese characters / string which are assigned an integer ID for use as a `ForeignKey`. This way,
datasets added to Cihai can achieve fast lookups and JOIN's via integers instead of unicode
characters.

If the word has multiple characters, Cihai will create the character ID for you and cite it for you
and all future entries added to your dataset will reference it.

This makes installing datasets in Cihai expensive (requiring a lookup of a unicode string in a
central table) before inserting into database, with the benefit of potentially returning a huge
cross-section of available CJK data in one swift stroke.

---

# Planning
Source: https://cihai.git-pull.com/design-and-planning/2013/spec/

---
orphan: true
---

(design-and-planning-2013-spec)=

# Planning

:::{note}

This document is part of brain storming the project of cihai. It's for historic purposes only.

:::

_Written Late 2013_

Scribblings on cihai dev.

## Configuration

It can accept a custom configuration file via command line with `-c`:

```console

$ python -m cihai -c myconfig.yml

```

Where your configuration file overrides the default settings. You can see the default settings in
the `cihai` package as `config.yml`.

Developers may use `dev/config.yml`. The TestCase will use the `test_config.yml`.

```console

$ python -m cihai

```

Will start up cihai with normal configuration settings. A configuration file may also be used.

```console

$ python -m cihai -c dev/config.yml

```

## History of CJK libraries

### Unihan

Unihan, which is short for "Han Unification" is a standard published by the Unicode Consortium for
CJK ideographs (also interchangeable referred to as "glyphs", "characters", "chars").

[Unihan's History] goes into greater detail on this. The first electronic release
was in July 1995 as [CJKXREF.TXT] (961 kB). The second release, which resembles the
formatting used in modern versions, was released in July 1996 with Unicode 2.0 as
[Unihan-1.txt]. In an accident, the `Unihan-1.txt` (7.9MB) file was missing the final
pieces after `U+8BC1`, no corrected version was made available. In May 1998,
[Unihan-2.txt] was released with Unicode 2.1.2.

Unihan Inc. is the center of the universe for all glyphs. For those who study Egyptian
hieroglyphics, which are still mysterious, they are covered in Unicode block
[U+13000..U+1342F].

[u+13000..u+1342f]: Fhttp://en.wikipedia.org/wiki/Egyptian_Hieroglyphs_(Unicode_block)
[unihan's history]: http://www.unicode.org/reports/tr38/#History
[cjkxref.txt]: http://www.unicode.org/Public/1.1-Update/CJKXREF.TXT
[unihan-1.txt]: http://www.unicode.org/Public/2.0-Update/Unihan-1.txt
[unihan-2.txt]: http://www.unicode.org/Public/2.1-Update/Unihan-2.txt

### cjklib

[cjklib](cjklib) is a major python library created by Christoph Burgmer for han character research.

"Cjklib provides language routines related to Han characters (characters based on Chinese characters
named Hanzi, Kanji, Hanja and chu Han respectively) used in writing of the Chinese, the Japanese,
infrequently the Korean and formerly the Vietnamese language(s). Functionality is included for
character pronunciations, radicals, glyph components, stroke decomposition and variant information.
Cjklib is implemented in Python."

<!---
cjklib: https://code.google.com/p/cjklib/
-->

### Cihai

Early iterations of Cihai focused on external API as a first. Every data set was to be a plugin.

The idea was, [Hanzi], a similar project in nodejs could share a similar API and datasets
could be universal. The potential would be to provide two high-quality libraries for python and
node, which are extendable to new data sets and reduce duplication.

It is better to take the time to discover the variable nature of datasets and how they interconnect.

### Current

The next iteration of cihai is to grasp an understand of:

- what different data sets look like, how they return data?
- is their commonality between all?
- how their results can illicit deeper research and exploring of chinese characters

This is an exploration phase.

## External API

### Cihai Spec

Both Cihai and Hanzi libraries can use a similar API.

- Reduce duplicated effort
- Provide a main, tested CJK library to Python and node
- Collaborate to assure both projects have access to open data sets and chinese character
  techniques.

Larger charter:

- Workgroup to develop a specification for core, pluggable CJK library across various programming
  languages.

  - follow best practices.

    - documentation
    - unit tests / ci
    - consistent with coding idiom's / pragmas (pythonic / pocoo / reits, connect / underscore /
      node)

  - be available on package archives (npm, pypi).
  - Across languages, core tools should have similar API method names, creating instance of data
    retrieval object
  - Extendable to new datasets as middleware.
  - Documentation for creating a new middleware.

- Find more data sets and encourage data providers / data owners to use an open data license.
- Find more libraries across various programming language with a CJK tool.

  - If project is a duplicate effort, notify that there is another effort underway and they can
    participate.
  - If project is a new tool:

    - see if they have a dataset. If it does, see license of ODC/OBDC.
    - see if their library is BSD or MIT. If not see if they're willing to license as such. \*
    - see if they are willing to use the Workgroup's API specification.
    - If willing, but no time, offer to patch.
    - If not interested at all, create an adapter for the project as a separate effort.

* if the library is GPL, it can cause conflict down the road, if the project author does not have
  the time / interest in adopting specification, even creating an adapter to their project could
  trigger GPL.

## Licensing

### Core software

BSD or MIT. The Core apps should be BSD 3-clause to protect the name of the app (Cihai or Hanzi).

### Extensions / Contrib licensing

Middleware can be included in the project as officially supported. Contrib and third party plugins
can be available under BSD or MIT.

### Data sets

Data for chinese should be available under the most permissive license possible.

## How should data be looked up?

I would like to try to encourage use of a single, simple hook, `.get`.

After `.get` is used, the arguments may then be passed through middleware classes / methods.

The same principle applies for `.reverse` matches.

### Chinese character

Currently, Hanzi uses:

```{code-block} javascript

hanzi.decompose('爱')

// transition to:
hanzi.get('爱')

hanzi.reverse('爱')  // to look up any indices / decompositions / words
where 爱 may match.

```

Currently cjklib uses:

```{code-block} python

cjk.getStrokeOrder(u'说')
#  transition to:
cjk.get('说')

```

```{code-block} python

Cihai.get('好')

```

### String of Chinese Characters

Use `.get` too. This may seem problematic, but checking the `.length` or `len()` of the argument can
suffice.

```{code-block} javascript

var decomposition = hanzi.decomposeMany('爱橄黃');
// transition to
var decomposition = hanzi.get('爱橄黃');

```

```{code-block} python

Cihai.get('爱橄黃')

```

## How should data returned look? Schema.

Questions:

- Is there already an open standard that can be adopted?
- Should `.get` return an raw object / dict or an object:

  ```
  c = c.get('你')  # return a ResultObject / Backbone.Model / mongoose
                   # document type of object.
  c.toJSON()  # backbone / sqlalchemy style
  ```

The data should follow the same schema. What would an API response for these possibilities look
like?

If something generic like .get() is entered,

- character decomposition
- a unihan field ('kDefinition', 'kStrokes', 'kFrequency', ...)
- <https://github.com/tsroten/zhon>
- <https://github.com/fxsjy/jieba>

If `.get` is the only way to retrieve hits, more possibilities exist.

For hanzi/node:

```{code-block} javascript

results = hanzi.get('你好。怎么样？')

```

or for cihai/python:

```{code-block} python

results = cihai.get('你好。怎么样？')

```

May return hits jieba middleware (jieba doesn't exist in node yet):

```
results.words = [
'你好',
'怎么样'
]
```

The user may then further tool:

```{code-block} python

for word in results.words:
    print(cihai.get(word))

```

or

```{code-block} javascript

for _.each(results.words, function(word) {
    console.log(hanzi.get(word))
});

```

:::{warning}

If dictionaries / datasets are extensible, there may be collision if they can reserve keys in the
official result namespace.

:::

Two plugins may could try to reserve `.words` as a name. Many dictionaries would want to reserve
`.definition` as a name.

To counteract this, a namespace can be adopted for middleware, we can have the Core resolve the
conflict:

1. Append underscore + number on conflict, etc. (`c.definition_1`, `c.definition_2`):

   The first middleware using `words` can get `result.words`. The middleware called after will get
   `results.words_1`.

   This is seen in [SQLAlchemy's labels] to [avoid label
   collisions].

2. Middleware / datasets use namespace with `_` (`c.unihan_kDefinition`):

   Pros:

   - iterable access to python `c.keys()` and `for var key in dict` in js.
   - all data returned can be accessed without nesting into dotted namespaces.

   Cons:

   - `result.unihan_kDefinition_these_things_getlong`
   - extension name and word separation can be confused.

3. Middleware may use dot namespace (`c.unihan.kDefinition`)

   Pros:

   - Internal Core API is far simpler and lighter
   - Easier to look at
   - More common practice, [aws_cli].
   - Middleware is a package module, symbolically `.`'s are used to separate modules and packages
     (java, python, informally in JS).

[sqlalchemy's labels]: https://github.com/zzzeek/sqlalchemy/blob/347e89044ce53ef0ec8d07937cd8279e9c4e5226/lib/sqlalchemy/sql/elements.py#L2393
[avoid label collisions]: https://github.com/zzzeek/sqlalchemy/blob/347e89044ce53ef0ec8d07937cd8279e9c4e5226/test/sql/test_compiler.py#L2549
[aws_cli]: https://github.com/aws/aws-cli

## Extension philosophy

The middleware approach provides the best practice to get the job done.

[Connect] in node represents the best practice in plugin architecture in JS. Middleware is
added as a way to provide a lite, dead-simple framework.

Cihai / Hanzi can take a similar approach.

Hanzi can take example directly from connect's approach. It is clean and proven. Cihai can note
middleware is already used in Django, packages can be maintained using pattern for Flask extensions
and sphinx. Flask already has experience / lesson's heard from packaging and namespacing extensions.

It can use the same data sets, similar API and extension strategy.

[connect]: https://github.com/senchalabs/connect

## Accessing extensions directly?

Perhaps extensions can also be searched directly:

```
c.unihan.get('好')
```

Third party API's can specify optional extra arguments, for instance, unihan may allow searching by
one field:

```
c.unihan.get('好', 'kDefinition')
```

This allows a simple way to "drill down" cjk data across extensions.

## API examples

Example:

```{code-block} python

obj = unihan.get('好') retrieves all rows. it will create a keyed object:
obj.kDefinition
obj['kDefinition']
obj.keys()
['kDefinition',]

obj = unihan.get('好', 'kDefinition', ...)
>>> obj.kDefinition
good
>>> obj.kStrokes
None

```

## Creating a cihai plugin

```{code-block} python

class Unihan(Cihai.Contrib):

    """
    Utilizing a parent class can allow raising ``NotImplementedError``
    errors. Further, this can provide access to a ``db``.

    However, ultimately, the only thing that's really required is::

        class Example(object):

            def get(self, char):
                return {
                    'char': char
                }

    """

    def get(self):
        pass

    def install(self):
        pass

cihai = Cihai()
cihai.use(Unihan)  # register the middleware with
c = cihai.get('好')
>>> c.keys()
['unihan']
>>> c.get('好')
<Cihai.Contrib.Unihan>
>>> print(c.get('好'))
>>> print(c.get('好').parent)

# Below this point, libunihan splits into subplugins for its libraries.
>>> print(dict(c.get('好')))

```

Cihai will allows extensibility to new dictionaries, vocabularies and data.

Middleware allows an arbitrary plugin to make data available.

By default, `Cihai()` creates an instance of Cihai with access to {meth}`Cihai.get`.

However, since no middleware are included with Cihai, no results are returned.

With `Cihai(middleware=[Cihai.Unihan])`

or `c = Cihai()`

`c.use(Cihai.Unihan)`

the Cihai_Unihan is available. What is Cihai_Unihan? Simply an object with:

```{code-block} python

class Unihan(Cihai.Contrib):

    pass

```

[hanzi]: https://github.com/nieldlr/Hanzi

---

# Internals Planning
Source: https://cihai.git-pull.com/design-and-planning/2017/spec/

---
orphan: true
---

(design-and-planning-2017-spec)=

# Internals Planning

Created 2017-04-29

1. {ref}`zero_config` - cihai should be able to work without configuration with a default data
   backend.
2. {ref}`incremental_config` cihai should be incrementally configurable, such as by specifying where
   data should be outputted.
3. {ref}`relational_backend` cihai will use [SQLAlchemy] as a database backend to story
   information for retrieval.
4. {ref}`automatic_extensions` cihai will make data accessible to third party libraries if they
   exist in the script's site-packages.

   e.g. If [pandas] is found, it will be able to return a {class}`pandas.DataFrame` for a
   queried set of information.

5. {ref}`unihan_core` cihai will use [UNIHAN] as a core and source of truth for information,
   as it contains all the glyphs and is reliable, free and well-maintained, and provides are good
   source of starter information.
6. {ref}`data_normalization` cihai will adopt a standard data format to store additional CJK data
   sets within.
7. {ref}`data_liberation` cihai libraries will be available under permissive licenses.

(zero-config)=

## Zero config

cihai will be able to be used immediately without a user configuring their system.

cihai will conform with the [XDG specification] for determining where to check
out data to. This includes:

- Where to store downloaded source files, e.g. _XDG_CACHE_HOME/cihai/downloads_
- Where store default backend data, e.g. _XDG_DATA_DIRS/cihai/data_, as well as the default file
  name used within, e.g. _data.sqlite_
- Where to check for configuration files, e.g. _XDG_CONFIG_HOME/cihai_, as well as the default file
  name used within, e.g. _cihai.yaml_

These default directories will be where cihai will, by default, store information and search for
configuration used in {ref}`incremental_config`.

(incremental-config)=

## Incremental configuration

The [SQLAlchemy] data backend used, which for SQLite, also includes the file path used
to store the SQLite file, is customizable.

[xdg specification]: https://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html

(relational-backend)=

## Relational backend

cihai will be powered by a relational database backend.

Most python distributions include support for [SQLite], which in conjunction with
{ref}`zero configuration <zero_config>`, makes for data store that will work across a wide array of
systems.

The data that cihai organizes will be primarily indexable by the glyph, and joined upon the glyph to
pull in an ever expanding assortment of information on that character.

(automatic-extensions)=

## Automatic extension detection

Don't reinvent the wheel, interoperate.

cihai will check for libraries such as pandas and other tabular libraries to easily produce native
objects for the user based on their cihai data lookup.

This comes at no performance penalty since the ability to export in a third party object, such as a
{class}`pandas.DataFrame`.

(unihan-core)=

## UNIHAN core

cihai's library of CJK information will be backed on the reliable [UNIHAN] database, which
is a approved by the Unicode Consortium.

### Operation

It is to be determined if UNIHAN will be vendorized in the packaging or retrieved remotely.

(data-normalization)=

## Data normalization

CJK datasets made available by cihai and contributors should follow, a yet to be determined,
standard for keeping data conserved, readily available and sustainable.

### Standards

The initial consideration, since 2013, was datasets would follow [Data Packages].

In place of Data Packages, a simpler, and more lax guideline and alone with python interfaces may be
considered. This determination is pending further review of datasets.

In place of frictionlessdata's data package libraries, cihai may option for a simpler, yet more
powerful system for making tabular data.

(data-liberation)=

## Data liberation

CC-0, MIT, ISC, BSD. Data sets should be available under licenses free from unintended side effects
of derivative creation.

[sqlite]: https://sqlite.org/
[pandas]: http://pandas.pydata.org/
[sqlalchemy]: https://www.sqlalchemy.org
[unihan]: http://www.unicode.org/reports/tr38/
[data packages]: http://frictionlessdata.io/data-packages/

---

# Extensions
Source: https://cihai.git-pull.com/design-and-planning/2018/plugin-system/

---
orphan: true
---

(design-and-planning-2018-plugin-system)=

# Extensions

Initially discussed in #131 [^id6]

The provisional recommended naming convention is this:

```
cihai_{dataset}(?_{extension})
cihai_unihan
cihai_unihan_variants
```

# Benefits

Automatic configuration Datasets are automatically namespaced as configs. Recipricocally, they have
access to the instance of cihai's configuration.

With the database mixin, your dataset can be automatically configured and work with your user's
configured DB backend. By default, it's SQLite! Reciprocally, you can also access databases and
tables connected to cihai.

```{code-block} python

c = Cihai()
c.add_dataset('unihan')  # install package
c.unihan.lookup('好')
# bootstraps
c.add_dataset('unihan', namespace='unihan2')  # install package
# checks
c.unihan2.lookup('好')

```

The optional namespace= allows for cihai to allow root-level access to datasets, while being able to
deprecate / move a dataset (however unlikely) if it were to conflict with a new method name /
property on the main cihai object.

It also allows namespacing a forked dataset, and adding it:

```{code-block} python

c.add_dataset('my_forked_unihan', namespace='unihan')  # install package

```

In the future, with libvcs:

```{code-block} python

c.add_dataset(Unihan, namespace='unihan')  # raw class

```

Future possibilities:

This makes it possible to develop locally, make a touch adjustment, maintain a VCS branch, in the
event a dataset fall out of sync or you want to hack on it / fork it.

```{code-block} python

c.add_dataset('package.to.unihan.Unihan', namespace='unihan')  # import string

```

# Versioning

Please provide a **version** on your package if you distribute it. This is to make sure cihai <->
dataset <-> extension have the fresh / new features and data, but can also lock API's so production
cases don't break.

# Extending datasets

You can also create packages, or even pure functions, that extend datasets.

Extensions for datasets have access to the dataset's configuration, the sqlalchemy database (if it
used it), and any other data-access it made available. For instance, if it had a custom data
backend, it could make that available to the extension for it to use.

```{code-block} python

c.add_dataset('unihan')
c.unihan.add_extension(Variants)

The same optional namespace= is possible:

c.unihan.variants.lookup('好')

c.unihan.add_extension(Variants, namespace='variants2')

c.unihan.variants2.lookup('好')

```

For the first draft, pointing straight to the package -> module -> object via import string it the
surest thing (since this is compatible with the user's local python package environment and works
well regardless of developing or general usage):

c.unihan.add_extension('package.to.import.unihan.Unihan')

This is similar to the way FLASK_CONFIG points to an object inside of a python module.

# Todo

- Allow cihai to install packages via pip.

# History

Early cihai ideas made SQLAlchemy a requirement.

The initial plan was to keep everything under a single namespace, database, and be able to reduce
queries by building big queries. This is phased out in turn of making cihai easy to hack on.

# Idea: pip-based add_dataset/add_extension

For development / hacking purposes, all of the same file, and vcs still exist:

```{code-block} python

# import string
c.add_dataset('package.to.unihan', classname='Unihan', namespace='unihan')
c.add_dataset(
    'git+https://github.com/moo/cihai-unihan#test-branch',
    classname='Unihan',
    namespace='unihan'
)
c.add_dataset('./path/to/dataset', classname='Unihan', namespace='unihan')

c.unihan.add_extension('cihai_unihan_variants')
c.unihan.add_extension(
    'git+https://github.com/moo/cihai-unihan#test-branch', namespace='unihan'
)
c.unihan.add_extension('./path/to/dataset', classname=Unihan, namespace='unihan')

```

# Idea: Namespacing

Of the now, the idea is to avoid overengineering / bureaucracy caused by adopting setuptools
namespacing. Be like Django, which doesn't enforce package naming.

cihai.extensions.datasetname.extensionname, but that has difficulties [^id7]

This namespace / organzation makes it possible for cihai to detect. [^id8]

cihai-contrib makes packages available under cihaicontrib, similar to sphinx-contrib's structure
[^id9] uses Python's namespaces [^id10]

# See Also

[^id6]:
    Add variant methods. Github issues for cihai. <https://github.com/cihai/cihai/pull/131>.
    Accessed September 1st, 2018.

[^id7]:
    Flask's deprecration of flask.ext and flask_ext: <http://flask.pocoo.org/docs/1.0/extensiondev/>
    Accessed September 1st, 2018.

[^id8]:
    Sphinx extensions <http://www.sphinx-doc.org/en/master/extdev/index.html#dev-extensions>.
    Accessed September 1st, 2018.

[^id9]:
    sphinx-contrib <https://github.com/sphinx-contrib/documentedlist/tree/master/sphinxcontrib>
    Accessed September 1st, 2018.

[^id10]:
    Python namespaces. <https://packaging.python.org/guides/packaging-namespace-packages/> Accessed
    September 7th, 2018.

---

# Design and Planning
Source: https://cihai.git-pull.com/design-and-planning/

(design-and-planning)=

# Design and Planning

## 2018

- {ref}`design-and-planning/2018/plugin-system`

## 2017

- {ref}`design-and-planning/2017/spec`

## Late 2013

These were part of the initial brain storming the project and preserved for historic purposes only.

- {ref}`design-and-planning/2013/extending`
- {ref}`design-and-planning/2013/internals`
- {ref}`design-and-planning/2013/information_liberation`
- {ref}`design-and-planning/2013/spec`

## Late 2012

- [State of datasets] on cjklib tracker

[state of datasets]: https://github.com/cburgmer/cjklib/issues/3

---

# Glossary
Source: https://cihai.git-pull.com/glossary/

(glossary)=

# Glossary

```{eval-rst}
.. glossary::

    CJK
        1. In computer software, `internationalization
           <https://en.wikipedia.org/wiki/Internationalization_and_localization>`_
           of *Chinese, Japanese, and Korean* language.
        2. In cihai, specifically, character information from Chinese,
           Japanese and Korean languages. Such as definitions, dictionary
           index references, phonetics, character decompositions and
           stroke information (order and amount).

    UNIHAN
        A character database of CJK information provided by the Unicode
        Consortium. See the documentation as http://www.unicode.org/reports/tr38/.

    cjklib
        A popular CJK library in python created by Christoph Burgmer

    cihai
        1. A CJK library in python built from the ground up under a
           permissive license and modern python development practices
        2. A workgroup for finding, digitizing, and preserving CJK
           dataasets

    SQLAlchemy
        A relational database library used to store and retrieve character
        information in cihai

    Data Packages
        A standard for storing data, see
        http://frictionlessdata.io/data-packages/

    XDG Base Directory
        A specification for directory locations designed to work across
        platforms. See https://specifications.freedesktop.org/basedir-spec/basedir-spec-0.6.html.
```

[internationalization]: https://en.wikipedia.org/wiki/Internationalization_and_localization

---

# Changelog
Source: https://cihai.git-pull.com/history/

(history)=

```{include} ../CHANGES

```

---

# cihai
Source: https://cihai.git-pull.com/

(index)=

# cihai

Python library for {term}`CJK` (Chinese, Japanese, Korean)
character data. Look up readings, definitions, and variants from the
[UNIHAN](datasets/unihan.md) database and beyond.

::::{grid} 1 2 3 3
:gutter: 2 2 3 3

:::{grid-item-card} Quickstart
:link: quickstart
:link-type: doc
Install and make your first lookup in 5 minutes.
:::

:::{grid-item-card} Topics
:link: topics/index
:link-type: doc
Features, examples, extending, troubleshooting.
:::

:::{grid-item-card} API Reference
:link: api/index
:link-type: doc
Every public class, function, and exception.
:::

::::

::::{grid} 1 2 3 3
:gutter: 2 2 3 3

:::{grid-item-card} Datasets
:link: datasets/index
:link-type: doc
UNIHAN and planned data sources.
:::

:::{grid-item-card} Internals
:link: internals/index
:link-type: doc
Private APIs -- no stability guarantee.
:::

:::{grid-item-card} Contributing
:link: project/index
:link-type: doc
Development setup, code style, release process.
:::

::::

## Install

```console
$ pip install cihai
```

```console
$ uv add cihai
```

## At a glance

```python
from cihai.core import Cihai

c = Cihai()

if not c.unihan.is_bootstrapped:  # download and install UNIHAN to db
    c.unihan.bootstrap()

query = c.unihan.lookup_char('好')
glyph = query.first()
print("lookup for 好: %s" % glyph.kDefinition)
# lookup for 好: good, excellent, fine; well

query = c.unihan.reverse_char('good')
print('matches for "good": %s ' % ', '.join([glph.char for glph in query]))
# matches for "good": 㑘, 㑤, 㓛, 㘬, 㙉, 㚃, ...
```

See [Quickstart](quickstart.md) for detailed installation and first steps.

```{toctree}
:hidden:

quickstart
topics/index
api/index
datasets/index
internals/index
project/index
design-and-planning/index
history
glossary
GitHub <https://github.com/cihai/cihai>
```

---

# Config reader - cihai._internal.config_reader
Source: https://cihai.git-pull.com/internals/api/config_reader/

# Config reader - `cihai._internal.config_reader`

```{eval-rst}
.. automodule:: cihai._internal.config_reader
   :members:
   :undoc-members:
   :show-inheritance:
   :no-value:
```

---

# Internal API
Source: https://cihai.git-pull.com/internals/api/

(internal_api)=

# Internal API

```{module} cihai

```

:::{warning}
Be careful with these! Internal APIs are **not** covered by version policies. They can break or be removed between minor versions!

If you need an internal API stabilized please [file an issue](https://github.com/cihai/cihai/issues).
:::

```{toctree}
:caption: Internal API
:maxdepth: 1

config_reader
types
```

---

# Typings - cihai._internal.types
Source: https://cihai.git-pull.com/internals/api/types/

# Typings - `cihai._internal.types`

```{eval-rst}
.. automodule:: cihai._internal.types
   :members:
   :undoc-members:
   :show-inheritance:
   :no-value:
```

---

# Internals
Source: https://cihai.git-pull.com/internals/

(internals)=

# Internals

:::{danger}
**No stability guarantee.** Internal APIs are **not** covered by version
policies. They can break or be removed between any minor versions without
notice.

If you need an internal API stabilized please [file an issue](https://github.com/cihai/cihai/issues).
:::

::::{grid} 1 1 2 2
:gutter: 2 2 3 3

:::{grid-item-card} Config Reader
:link: api/config_reader
:link-type: doc
Internal configuration file loading and expansion.
:::

:::{grid-item-card} Types
:link: api/types
:link-type: doc
Internal type aliases and protocols.
:::

::::

```{toctree}
:hidden:

api/index
```

---

# Code Style
Source: https://cihai.git-pull.com/project/code-style/

# Code Style

cihai follows consistent coding standards across all repositories in the
cihai organization.

## Formatting and linting

[ruff](https://ruff.rs) handles formatting, import sorting, and linting in a
single tool.

```console
$ uv run ruff check .
```

```console
$ uv run ruff format .
```

Auto-fix safe lint violations:

```console
$ uv run ruff check . --fix --show-fixes
```

## Type checking

[mypy](http://mypy-lang.org/) with `strict = true` is used for static type
checking.

```console
$ uv run mypy .
```

## Docstrings

Use [NumPy-style](https://numpydoc.readthedocs.io/en/latest/format.html)
docstrings with reStructuredText markup.

## Imports

- Use `from __future__ import annotations` at the top of every module.
- Use namespace imports for the standard library: `import pathlib` rather than
  `from pathlib import Path`.
- Use `import typing as t` and access members via `t.NamedTuple`, etc.

---

# Contributing
Source: https://cihai.git-pull.com/project/contributing/

(contributing)=

(developing)=

(workflow)=

# Contributing

As an open source project, all cihai projects accept contributions through GitHub, GitLab and Codeberg. Below you will find
resources on the internals of the project.

:::{note}

This guide applies to all cihai projects, not just the cihai repo.

:::

Cihai projects use standard conventions and patterns based on best practices in python.

To be efficient at debugging, developing, testing, documenting, etc. it helps to familiarize yourself with the tool within, independently if needed.

`<cihai-project>` can be assumed to be an existing or future cihai project, including
[cihai](https://github.com/cihai/cihai),
[cihai-cli](https://github.com/cihai/cihai-cli),
[unihan-etl](https://github.com/cihai/unihan-etl),
[unihan-db](https://github.com/cihai/unihan-db).
See [GitHub](https://github.com/cihai), [GitLab](https:/gitlab.com/cihai) and
[Codeberg](https://codeberg.org/cihai).

## Development environment

[uv] is a required package to develop.

```console
$ git clone https://github.com/cihai/<cihai-project>.git
```

```console
$ cd <cihai-project>
```

So if `<cihai-project>` is [cihai]:

```console
$ git clone https://github.com/cihai/cihai.git
```

```console
$ cd cihai
```

## Install dependencies

```console
$ uv sync --all-extras --dev
```

Justfile commands prefixed with `watch-` will watch files and rerun.

## Tests

[pytest] is used for tests.

```console
$ uv run py.test
```

### Rerun on file change

via [pytest-watcher] (works out of the box):

```console
$ just start
```

via [`entr(1)`] (requires installation):

```console
$ just watch-test
```

[pytest-watcher]: https://github.com/olzhasar/pytest-watcher

### Manual (just the command, please)

```console
$ uv run py.test
```

or:

```console
$ just test
```

### pytest options

_For filename / test names within, examples will be for [cihai], if using a
different cihai project check the filename and test names accordingly_:

`PYTEST_ADDOPTS` can be set in the commands below. For more
information read [docs.pytest.com] for the latest documentation.

[docs.pytest.com]: https://docs.pytest.org/

Verbose:

```console
$ env PYTEST_ADDOPTS="-verbose" just start
```

Pick a file:

```console
$ env PYTEST_ADDOPTS="tests/test_cihai.py" just start
```

Drop into `test_cihai_version()` in `tests/test_cihai.py`:

```console
$ env PYTEST_ADDOPTS="-s -x -vv tests/test_cihai.py" just start
```

Drop into `test_cihai_version()` in `tests/test_cihai.py` and stop on first error:

```console
$ env PYTEST_ADDOPTS="-s -x -vv tests/test_cihai.py::test_cihai" just start
```

Drop into `pdb` on first error:

```console
$ env PYTEST_ADDOPTS="-x -s --pdb" just start
```

If you have [ipython] installed:

```console
$ env PYTEST_ADDOPTS="--pdbcls=IPython.terminal.debugger:TerminalPdb" just start
```

[ipython]: https://ipython.org/

```console
$ just test
```

You probably didn't see anything but tests scroll by.

If you found a problem or are trying to write a test, you can file an
on the tracker for the relevant cihai project.

(test-specific-tests)=

### Manual invocation

Test only a file:

```console
$ py.test tests/test_cihai.py
```

will test the `tests/test_cihai.py` tests.

```console
$ py.test tests/test_cihai.py::test_cihai_version
```

tests `test_cihai_version()` inside of `tests/test_cihai.py`.

Multiple can be separated by spaces:

```console
$ py.test tests/test_{conversion,exc}.py tests/test_config.py::test_configurator
```

## Documentation

[sphinx-autobuild] will automatically build the docs, watch for file changes and launch a server.

From home directory: `just start-docs` From inside `docs/`: `just start`

[sphinx-autobuild]: https://github.com/executablebooks/sphinx-autobuild

### Manual documentation (the hard way)

`cd docs/` and `just html` to build. `just serve` to start http server.

Helpers: `just build-docs`, `just serve-docs`

Rebuild docs on file change: `just watch-docs` (requires [entr(1)])

Rebuild docs and run server via one terminal: `just dev-docs`

### View documentation locally

To find the URL of the preview server, read the terminal, the URL may very
depending on the project! An example of what to look for:

```console
[I 220816 14:43:41 server:335] Serving on http://127.0.0.1:8035
```

## Formatting / Linting

### ruff

The project uses [ruff] to handle formatting, sorting imports, and linting.

````{tab} Command

uv:

```console
$ uv run ruff
```

If you setup manually:

```console
$ ruff check .
```

````

````{tab} just

```console
$ just ruff
```

````

````{tab} Watch

```console
$ just watch-ruff
```

requires [`entr(1)`].

````

````{tab} Fix files

uv:

```console
$ uv run ruff check . --fix
```

If you setup manually:

```console
$ ruff check . --fix
```

````

#### ruff format

[ruff format] is used for formatting.

````{tab} Command

uv:

```console
$ uv run ruff format .
```

If you setup manually:

```console
$ ruff format .
```

````

````{tab} just

```console
$ just ruff-format
```

````

### mypy

[mypy] is used for static type checking.

````{tab} Command

uv:

```console
$ uv run mypy .
```

If you setup manually:

```console
$ mypy .
```

````

````{tab} just

```console
$ just mypy
```

````

````{tab} Watch

```console
$ just watch-mypy
```

requires [`entr(1)`].
````

## Releasing

Since this software used in production projects, we don't release breaking changes
until there's a major feature release.

Choose what the next version is. Assuming it's version 0.9.0, it could be:

- 0.9.0post0: postrelease, if there was a packaging issue
- 0.9.1: bugfix / security / tweak
- 0.10.0: breaking changes, new features

Let's assume we pick 0.9.1

`CHANGES`: Assure any PRs merged since last release are mentioned. Give a thank you to the
contributor. Set the header with the new version and the date. Leave the "current" header and
_Insert changes/features/fixes for next release here_ at the top:

```markdown
## package-name 0.10.x (unreleased)

- _Insert changes/features/fixes for next release here_

## package-name 0.9.1 (2020-10-12)

- :issue:`1`: Fix bug
```

`package_name/__init__.py` and `__about__.py` - Set version

```console
$ git commit -m 'Tag v0.9.1'
```

```console
$ git push
```

Important: Create and push the tag. Make sure the version is correct and the
`pyproject.toml` and `__about__.py` match the version being deployed.

```console
$ git tag v0.9.1
```

```console
$ git push --tags
```

### Automated deployment

CI will automatically push to the PyPI index when a tag is pushed.

### Manual deployment

[uv] handles virtualenv creation, package requirements, versioning,
building, and publishing. Therefore there is no setup.py or requirements files.

Update `__version__` in `__about__.py` and `pyproject.toml`::

    git commit -m 'build(cihai): Tag v0.1.1'
    git tag v0.1.1
    git push
    git push --tags

GitHub Actions will detect the new git tag, and in its own workflow run `uv
build` and push to PyPI.

[uv]: https://github.com/astral-sh/uv
[entr(1)]: http://eradman.com/entrproject/
[`entr(1)`]: http://eradman.com/entrproject/
[ruff]: https://ruff.rs
[mypy]: http://mypy-lang.org/

---

# Project
Source: https://cihai.git-pull.com/project/

(project)=

# Project

Information for contributors and maintainers.

::::{grid} 1 1 2 2
:gutter: 2 2 3 3

:::{grid-item-card} Contributing
:link: contributing
:link-type: doc
Development setup, running tests, submitting PRs.
:::

:::{grid-item-card} Code Style
:link: code-style
:link-type: doc
Ruff, mypy, NumPy docstrings, import conventions.
:::

:::{grid-item-card} Releasing
:link: releasing
:link-type: doc
Release checklist and version policy.
:::

::::

```{toctree}
:hidden:

contributing
code-style
releasing
```

---

# Releasing
Source: https://cihai.git-pull.com/project/releasing/

# Releasing

Since cihai is used in production projects, breaking changes are deferred
until a major feature release.

## Version numbering

Given a current version of `0.36.0`:

- **0.36.0post0** -- post-release, packaging fix only
- **0.36.1** -- bugfix / security / tweak
- **0.37.0** -- new features or breaking changes

## Release checklist

1. Update `CHANGES` -- ensure every merged PR since the last tag is listed.
   Set the header to the new version and today's date. Keep the *unreleased*
   placeholder at the top.

2. Bump the version in `pyproject.toml` and `src/cihai/__about__.py`.

3. Commit and tag:

```console
$ git commit -m 'Tag v0.36.1'
```

```console
$ git tag v0.36.1
```

```console
$ git push && git push --tags
```

## Automated deployment

GitHub Actions detects the new tag and runs `uv build` followed by a push
to PyPI.

## Manual deployment

If CI is unavailable:

```console
$ uv build
```

```console
$ uv publish
```

---

# Quickstart
Source: https://cihai.git-pull.com/quickstart/

(quickstart)=

# Quickstart

cihai is designed to work out-of-the-box without configuration.

## Installation

```console
$ pip install --user cihai
```

(developmental-releases)=

### Developmental releases

New versions of cihai are published to PyPI as alpha, beta, or release candidates.
Identifiers like `a1`, `b1`, and `rc1` mark alpha, beta, and release candidates, respectively.

- [pip]\:

  ```console
  $ pip install --user --upgrade --pre cihai
  ```

- [pipx]\:

  ```console
  $ pipx run --pip-args '\--pre' --spec 'cihai' python -c "import cihai; print(cihai.__version__)"
  ```

- [uv]\:

  ```console
  $ uv add cihai --prerelease allow
  ```

- [uvx]\:

  ```console
  $ uvx --from 'cihai' --prerelease allow python -c "import cihai; print(cihai.__version__)"
  ```

(configuration)=

## Configuration

By default, cihai requires no configuration. The defaults file locations are
{term}`XDG Base Directory` for the users' system, as well as SQLite to store, seek, and retrieve
data.

You can override cihai's default storage and file directories via a config file.

The default configuration is at {attr}`cihai.constants.DEFAULT_CONFIG`.

Database configuration accepts any SQLAlchemy {sqlalchemy:ref}`database_urls`. If you're using a DB
other than SQLite, such as Postgres, be sure to install the requisite driver, such as
[psycopg][psycopg].

[xdg directories]: https://specifications.freedesktop.org/basedir-spec/basedir-spec-0.6.html

### Advanced Config

cihai is designed to allow you to incrementally override settings to your liking.

Internally, the config is parsed through {func}`cihai.conf.expand_config`. This will replace
environment variables, XDG variables and tildes. You can also enter absolute paths.

Environmental variables require a dollar sign added to them, e.g. `${ENVVAR}`. XDG variables such as
_user_cache_dir_, _user_config_dir_, _user_data_dir_, _user_log_dir_, _site_config_dir_,
_site_data_dir_ are done via curly brackets only. E.g. `{site_config_dir}`. Tildes are just
replaced.

```{code-block} yaml

database:
  url: '${DATABASE_URL}'
dirs:
  data: '{user_data_dir}/mydata'
  cache: '~/cache/cihai'
  logs: '$ENVVAR/logs'

```

In the example above, Heroku's
[DATABASE_URL](https://devcenter.heroku.com/articles/heroku-postgresql#establish-primary-db) is
replaced as an environmental variable. The XDG variable for _user_data_dir_ is combined with
_mydata/_, which makes the data stored deeper. The environmental variable _$ENVVAR_ is also
replaced.

You may point to a custom config with the `-c` argument, `$ cihai -c path/to/config.yaml`.

You can also override bootstrapping settings. The "unihan_options" dictionary in Cihai's
configuration will be passed right to {ref}`unihan-etl:index`'s {class}`unihan_etl.core.Packager`
`option` param, which is then merged on top of unihan-etl's default settings:

```{code-block} yaml

unihan_options:
   source: 'https://custom-mirror.com/Unihan.zip'  # local paths work too
   work_dir: '/path/to/unzip/files'
   zip_path: '/path/to/store/Unihan.zip'
   fields: ['kDefinition']  # and / or:
   input_files: ['Unihan_Readings.txt']

```

[psycopg]: http://initd.org/psycopg/
[pip]: https://pip.pypa.io/en/stable/
[pipx]: https://pypa.github.io/pipx/docs/
[uv]: https://docs.astral.sh/uv/
[uvx]: https://docs.astral.sh/uv/guides/tools/

---

# Examples
Source: https://cihai.git-pull.com/topics/examples/

(examples)=

# Examples

## Basic usage

_examples/basic_usage.py_:

```{literalinclude} ../../examples/basic_usage.py
:language: python

```

## Character variants

_examples/variants.py_:

```{literalinclude} ../../examples/variants.py
:language: python

```

_examples/variant_ts_difficulties.py_:

```{literalinclude} ../../examples/variant_ts_difficulties.py
:language: python

```

---

# Extending cihai
Source: https://cihai.git-pull.com/topics/extending/

(extend)=

# Extending cihai

Use cihai's abstraction and your dataset's users can receive easy configuration, SQL access, and be
available in a growing list of CJKV information.

## Creating new dataset

Expand cihai's knowledge! Create a {class}`cihai.extend.Dataset`.

You can also make your dataset available in open source so other cihai users can use it! If you do,
bring it up on the [issue tracker]!

_examples/dataset.py_:

```{literalinclude} ../../examples/dataset.py
:language: python

```

In addition, view our reference implementation of UNIHAN, which is incorporated as a dataset. See
{class}`cihai.data.unihan.dataset.Unihan`

[issue tracker]: https://github.com/cihai/cihai/issues

## Plugins: Adding features to a dataset

Extend a dataset with custom behavior to avoid repetition. Create a
{class}`cihai.extend.DatasetPlugin`.

See our reference implementation of {class}`cihai.data.unihan.dataset.UnihanVariants`

Datasets can be augmented with computed methods.

These utilize a dataset to pull information out, but are frequently used / generic enough to write a

An example of this would be the
[suggestion to add variant lookups for UNIHAN](https://github.com/cihai/cihai/pull/131).

## Combining datasets

Combining general datasets in general is usually considered general library usage. But if you're
usage is common, saves from repetition, it's worth considering making into a reusable extension and
open sourcing it.

Using the library to mix and match data from various sources is what cihai is meant to do! If you
have a way you're using cihai that you think would be helpful, definitely create an issue, a gist,
github repo, etc! License it permissively please (MIT, BSD, ISC, etc!)

---

# Features
Source: https://cihai.git-pull.com/topics/features/

(features)=

# Features

- Handling CJK Variants

  cihai builds upon [UNIHAN]: "thousands of years worth of
  writing have produced thousands of pairs which can be used more-or-less interchangeably." For more
  information, see "Unification Rules" on page 679 of _The Unicode Standard_
  ([.pdf](http://www.unicode.org/versions/Unicode9.0.0/ch18.pdf)).

- Extensibie

  cihai will be able to pull remote CJK datasets.

  In addition, the handling of variants will create new ways to discover and interpret CJK
  characters while using these datasets.

- Python API and CLI application

  Cihai can be used as a Python {ref}`API` as well as a command line application via `$ cihai`.

- Asian encoding swiss army knife

  Functions under the hood such as {ref}`cihai.conversion <cihai.conversion>` are tested across
  python implementations to handle a growing assortment of Asian encodings.

[unihan]: http://unicode.org/charts/unihan.html
[variants]: http://www.unicode.org/reports/tr38/tr38-21.html#N10211

---

# Topics
Source: https://cihai.git-pull.com/topics/

# Topics

Explore cihai's capabilities and underlying concepts at a high level, with detailed explanations to help you understand its design and usage.

::::{grid} 1 1 2 2
:gutter: 2 2 3 3

:::{grid-item-card} Features
:link: features
:link-type: doc
CJK variants, extensibility, encoding utilities.
:::

:::{grid-item-card} Examples
:link: examples
:link-type: doc
Annotated code samples for common tasks.
:::

:::{grid-item-card} Extending
:link: extending
:link-type: doc
Create datasets, plugins, and combine data sources.
:::

:::{grid-item-card} Troubleshooting
:link: troubleshooting
:link-type: doc
Common issues and their solutions.
:::

::::

```{toctree}
:hidden:

features
examples
extending
troubleshooting
```

---

# Troubleshooting
Source: https://cihai.git-pull.com/topics/troubleshooting/

(troubleshooting)=

# Troubleshooting

## Python 2.7 and UCS

Note, to get this working on python 2.7, you must have python built with _UCS4_ via
`--enable-unicode=ucs4`. You can test for UCS4 with:

```{code-block} python

>>> import sys
>>> sys.maxunicode > 0xffff
True

```

Most packaged and included python distributions will already be build with UCS4 (such as Ubuntu's
system python). On python 3.3 and greater, this distinction no longer exists, no action is needed.

---