Extending cihai¶

Use cihai’s abstraction and your dataset’s users can receive easy configuration, SQL access, and be available in a growing list of CJKV information.

Creating new dataset¶

Expand cihai’s knowledge! Create a cihai.extend.Dataset.

You can also make your dataset available in open source so other cihai users can use it! If you do, bring it up on the issue tracker!

examples/dataset.py:

#!/usr/bin/env python
"""Example of a custom dataset for cihai."""

from __future__ import annotations

import logging

from cihai.core import Cihai
from cihai.extend import Dataset

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(message)s")


data = {}  # any data source, internal, a file, on the internet, in a database...


class MyDataset(Dataset):
    """Hardcoded example dataset for Cihai."""

    def bootstrap(self) -> None:  # automatically ran with .add_dataset, if exists
        """Initialize hard-coded dataset."""
        # Use this to setup your dataset, check if updates are needed, etc.
        data.update({"好": "Good", "好好": "Hello"})

    def givemedata(self, key: str) -> str:
        """Return data via direct key match."""
        return data[key]

    def search(self, needle: str) -> dict[str, object]:
        """Return key-value mapping of keys matching a subset of value."""
        return {k: v for k, v in data.items() if needle in k}

    def backwards(self, needle: str) -> list[str]:
        """Reverse lookup."""
        return [k for k, v in data.items() if needle in v]


def run() -> None:
    """Run hard-coded example dataset."""
    c = Cihai(unihan=False)

    c.add_dataset(MyDataset, namespace="moo")
    my_dataset = MyDataset()
    my_dataset.bootstrap()

    log.info("Definitions exactly for 好", my_dataset.givemedata("好"))

    log.info("Definitions matching with 你好:", ", ".join(my_dataset.search("好")))

    log.info("Reverse definition with Good:", ", ".join(my_dataset.backwards("Good")))


if __name__ == "__main__":
    run()

In addition, view our reference implementation of UNIHAN, which is incorporated as a dataset. See cihai.data.unihan.dataset.Unihan

Plugins: Adding features to a dataset¶

Extend a dataset with custom behavior to avoid repetition. Create a cihai.extend.DatasetPlugin.

See our reference implementation of cihai.data.unihan.dataset.UnihanVariants

Datasets can be augmented with computed methods.

These utilize a dataset to pull information out, but are frequently used / generic enough to write a

An example of this would be the suggestion to add variant lookups for UNIHAN.

Combining datasets¶

Combining general datasets in general is usually considered general library usage. But if you’re usage is common, saves from repetition, it’s worth considering making into a reusable extension and open sourcing it.

Using the library to mix and match data from various sources is what cihai is meant to do! If you have a way you’re using cihai that you think would be helpful, definitely create an issue, a gist, github repo, etc! License it permissively please (MIT, BSD, ISC, etc!)