so i found myself making another RAG bot (for the 2342148th time) and meanwhile, explaining to my juniors about why we should use chunking in our RAG bots, only to realise that i would have to write chunking all over again unless i use the bloated software library X or the extremely feature-less library Y. WHY CAN I NOT HAVE SOMETHING JUST RIGHT, UGH?

Can i just install, import and run chunking and not have to worry about dependencies, bloat, speed or other factors?

Well, with chonkie you can! (chonkie boi is a gud boi)

Feature-rich

All the CHONKs you’d ever need for your RAG applications

Easy to use

Install, Import, CHONK - it’s that simple!

Lightning Fast

CHONK at the speed of light! zooooom

Wide Support

Supports all your favorite tokenizer, model and API CHONKs

Lightweight

No bloat, just CHONK - only 9.7MB base installation

Cute Mascot

psst it’s a pygmy hippo btw! Moto Moto approved

Quick Start

Get started with Chonkie in three simple steps: Install, Import and CHONK!

Installation

pip install chonkie

Want more features? :

pip install chonkie[all]

Chonkie follows a special approach to dependencies, keeping the base installation lightweight while allowing you to add extra features as and when needed. Please check the Installation page for more details.

Usage

Here’s a basic example to get you started:

# First import the chunker you want from Chonkie 
from chonkie import TokenChunker

# Initialize the chunker
chunker = TokenChunker() # defaults to using GPT2 tokenizer

# Here's some text to chunk
text = """
Woah! Chonkie, the chunking library is so cool! I love the tiny hippo hehe.
"""

# Chunk some text
chunks = chunker(text)

# Access chunks
for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")

Documentation

Ready to learn more about Chonkie?