Extending cihai#

Use cihai’s abstraction and your dataset’s users can receive easy configuration, SQL access, and be available in a growing list of CJKV information.

Creating new dataset#

Expand cihai’s knowledge! Create a cihai.extend.Dataset.

You can also make your dataset available in open source so other cihai users can use it! If you do, bring it up on the issue tracker!

examples/dataset.py:

#!/usr/bin/env python
"""Example of a custom dataset for cihai."""

import logging
import typing as t

from cihai.core import Cihai
from cihai.extend import Dataset

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(message)s")


data = {}  # any data source, internal, a file, on the internet, in a database...


class MyDataset(Dataset):
    """Hardcoded example dataset for Cihai."""

    def bootstrap(self) -> None:  # automatically ran with .add_dataset, if exists
        """Initialize hard-coded dataset."""
        # Use this to setup your dataset, check if updates are needed, etc.
        data.update({"好": "Good", "好好": "Hello"})

    def givemedata(self, key: str) -> str:
        """Return data via direct key match."""
        return data[key]

    def search(self, needle: str) -> t.Dict[str, object]:
        """Return key-value mapping of keys matching a subset of value."""
        return {k: v for k, v in data.items() if needle in k}

    def backwards(self, needle: str) -> t.List[str]:
        """Reverse lookup."""
        return [k for k, v in data.items() if needle in v]


def run() -> None:
    """Run hard-coded example dataset."""
    c = Cihai(unihan=False)

    c.add_dataset(MyDataset, namespace="moo")
    my_dataset = MyDataset()
    my_dataset.bootstrap()

    log.info("Definitions exactly for 好", my_dataset.givemedata("好"))

    log.info("Definitions matching with 你好:", ", ".join(my_dataset.search("好")))

    log.info("Reverse definition with Good:", ", ".join(my_dataset.backwards("Good")))


if __name__ == "__main__":
    run()

In addition, view our reference implementation of UNIHAN, which is incorporated as a dataset. See cihai.data.unihan.dataset.Unihan

Plugins: Adding features to a dataset#

Extend a dataset with custom behavior to avoid repetition. Create a cihai.extend.DatasetPlugin.

See our reference implementation of cihai.data.unihan.dataset.UnihanVariants

Datasets can be augmented with computed methods.

These utilize a dataset to pull information out, but are frequently used / generic enough to write a

An example of this would be the suggestion to add variant lookups for UNIHAN.

Combining datasets#

Combining general datasets in general is usually considered general library usage. But if you’re usage is common, saves from repetition, it’s worth considering making into a reusable extension and open sourcing it.

Using the library to mix and match data from various sources is what cihai is meant to do! If you have a way you’re using cihai that you think would be helpful, definitely create an issue, a gist, github repo, etc! License it permissively please (MIT, BSD, ISC, etc!)