Chunkers Overview
Overview of the different chunkers available in Chonkie
Chonkie provides multiple chunking strategies to handle different text processing needs. Each chunker in Chonkie is designed to follow the same core principles outlined in the concepts page.
TokenChunker
Splits text into fixed-size token chunks. Best for maintaining consistent chunk sizes and working with token-based models.
WordChunker
Splits text while preserving word boundaries. Ideal when you need human-readable chunks without breaking words.
SentenceChunker
Splits text at sentence boundaries. Perfect for maintaining semantic completeness at the sentence level.
RecursiveChunker
Recursively chunks documents into smaller chunks. Best for long documents with well-defined structure.
SemanticChunker
Groups content based on semantic similarity. Best for preserving context and topical coherence.
SDPMChunker
Chunks using Semantic Double-Pass Merging (SDPM) algorithm, best for maintaining topical coherence when text has frequent breaks.
LateChunker
Chunks using Late Chunking algorithm, best for higher recall in your RAG applications.
Availability
Different chunkers are available depending on your installation:
Chunker | Default | embeddings | ’all’ |
---|---|---|---|
TokenChunker | ✅ | ✅ | ✅ |
WordChunker | ✅ | ✅ | ✅ |
SentenceChunker | ✅ | ✅ | ✅ |
RecursiveChunker | ✅ | ✅ | ✅ |
SemanticChunker | ❌ | ✅ | ✅ |
SDPMChunker | ❌ | ✅ | ✅ |
LateChunker | ❌ | ✅ | ✅ |
Universal Tokenizer Support
All chunkers can accept any tokenizer in their tokenizer
or tokenizer_or_token_counter
argument, including tiktoken
, huggingface/tokenizers
or transformers
.
You can also pass a token_counter
function to the chunker, which will be used to count the tokens in the text.
And if you want to use a token_counter
function, you can do:
Furthermore, you can pass in character
or word
in the tokenizer_or_token_counter
argument to count the number of characters or words in the text.
Common Interface
All chunkers share a consistent interface. You can directly call the chunker on a string or a list of strings.
F.A.Q.
Was this page helpful?