Chonkie allows you to use your own embeddings handler by creating a child class of the BaseEmbeddings class, and implementing the necessary methods. It’s quite simple!

Example

First, we create a child class of the BaseEmbeddings class, and implement the necessary methods.

from chonkie.embeddings import BaseEmbeddings

class CustomEmbeddings(BaseEmbeddings):
    
    @property
    def dimension(self) -> int:
        ...

    def embed(self, text: str) -> "np.ndarray":
        ...

    def embed_batch(self, texts: List[str]) -> List["np.ndarray"]:
        ...

    def count_tokens(self, text: str) -> int:
        ...

    def count_tokens_batch(self, texts: List[str]) -> List[int]:
        ...

    def get_tokenizer_or_token_counter(self):
        ...

    @classmethod
    def is_available(cls) -> bool:
        ...

    def __repr__(self) -> str:
        ...

At this point, we have a custom embeddings handler, we can use it like this:

embeddings = CustomEmbeddings()

But let’s say we want to use this together with the AutoEmbeddings class, for the sake of convenience. We can do this by registering it with the EmbeddingsRegistry.

from chonkie.embeddings import EmbeddingsRegistry

# Register with the embeddings registry
EmbeddingsRegistry.register(
    "custom",
    CustomEmbeddings,
    pattern=r"^custom/|^model-name", 
    valid_types=["CustomEmbeddings"]
)

Now we can use our custom embeddings handler with the AutoEmbeddings class.

embeddings = AutoEmbeddings.get_embeddings("custom/my-custom-embeddings")

Finally, we can use our custom embeddings handler in the same way we would use any other embeddings handler.

chunker = SemanticChunker(embeddings=embeddings, similarity_threshold=0.7)
chunks = chunker(text)