SemanticChunker
Split text into chunks based on semantic similarity
The SemanticChunker
splits text into chunks based on semantic similarity, ensuring that related content stays together in the same chunk. This approach is particularly useful for RAG applications where context preservation is crucial.
Installation
SemanticChunker requires additional dependencies for semantic capabilities. You can install it with:
Initialization
Parameters
Model identifier or embedding model instance
Mode for grouping sentences, either “cumulative” or “window”
When in the range [0,1], denotes the similarity threshold to consider sentences similar. When in the range (1,100], interprets the given value as a percentile threshold. When set to “auto”, the threshold is automatically calculated.
Maximum tokens per chunk
Number of sentences to consider for similarity threshold calculation
Minimum number of sentences per chunk
Minimum number of characters per sentence
Minimum tokens per chunk
Step size for similarity threshold calculation
Delimiters to split sentences on
Usage
Single Text Chunking
Batch Chunking
Supported Embeddings
SemanticChunker supports multiple embedding providers through Chonkie’s embedding system. See the Embeddings Overview for more information.
Return Type
SemanticChunker returns SemanticChunk
objects with optimized storage using slots:
Was this page helpful?