Chonkie provides multiple chunking strategies to handle different text processing needs. Each chunker in Chonkie is designed to follow the same core principles outlined in the concepts page.

Availability

Different chunkers are available depending on your installation:

ChunkerDefaultembeddings’all’
TokenChunker
WordChunker
SentenceChunker
RecursiveChunker
SemanticChunker
SDPMChunker
LateChunker

Universal Tokenizer Support

All chunkers can accept any tokenizer in their tokenizer or tokenizer_or_token_counter argument, including tiktoken, huggingface/tokenizers or transformers. You can also pass a token_counter function to the chunker, which will be used to count the tokens in the text.

from tokenizer import Tokenizer
from chonkie import TokenChunker

tokenizer = Tokenizer.from_pretrained("gpt2")
chunker = TokenChunker(tokenizer=tokenizer)

chunks = chunker("Hello, world!")

And if you want to use a token_counter function, you can do:

from chonkie import SentenceChunker

# Define your own token counter function
def token_counter(text):
    return len(text.split())

# Pass the token counter function to the chunker
chunker = SentenceChunker(tokenizer_or_token_counter=token_counter)

chunks = chunker("Hello, world!")

Furthermore, you can pass in character or word in the tokenizer_or_token_counter argument to count the number of characters or words in the text.

from chonkie import SentenceChunker

# Count the number of characters in the text
chunker = SentenceChunker(tokenizer_or_token_counter="character")

chunks = chunker("Hello, world!")

Common Interface

All chunkers share a consistent interface. You can directly call the chunker on a string or a list of strings.

# Direct calling
chunks = chunker(text)  # or chunker([text1, text2])

# Single text chunking
chunks = chunker.chunk(text)

# Batch processing
chunks = chunker.chunk_batch(texts, show_progress_bar=True)

F.A.Q.