SemanticChunker
Split text into chunks based on semantic similarity
The SemanticChunker
splits text into chunks based on semantic similarity, ensuring that related content stays together in the same chunk. This approach is particularly useful for RAG applications where context preservation is crucial.
Installation
SemanticChunker requires additional dependencies for semantic capabilities. You can install it with:
Initialization
Here’s how to initialize the SemanticChunker with default parameters. In most cases, you can use the default parameters without any issues.
Parameters
Model identifier or embedding model instance
Mode for grouping sentences, either “cumulative” or “window” Window mode will group sentences based on similarity within a window of sentences. Cumulative dynamically adjusts the window size based on the similarity of the sentences, but will take longer to process.
When in the range [0,1], denotes the similarity threshold to consider sentences similar. When in the range (1,100], interprets the given value as a percentile threshold. When set to “auto”, the threshold is automatically calculated.
Maximum tokens per chunk
Number of sentences to consider for similarity threshold calculation
Minimum number of sentences per chunk
Minimum number of characters per sentence
Minimum tokens per chunk
Step size for similarity threshold calculation
Delimiters to split sentences on
Return type, either “chunks” or “texts”
Methods
The SemanticChunker has the following methods:
__call__
The __call__
method allows you to call the chunker like a function, which uses the .chunk
or .chunk_batch
method internally, depending on the arguments passed.
Arguments:
Text to chunk.
Whether to show a progress bar.
Returns:
Result of the chunking process.
.chunk
The .chunk
method chunks a single text into chunks.
Arguments:
Text to chunk.
Returns:
Result of the chunking process.
.chunk_batch
The .chunk_batch
method chunks a batch of texts into chunks.
Arguments:
List of texts to chunk.
Whether to show a progress bar.
Returns:
Result of the chunking process.
Usage Examples
Supported Embeddings
SemanticChunker supports multiple embedding providers through Chonkie’s embedding system. See the Embeddings Overview for more information.
Associated Return Type
SemanticChunker returns SemanticChunk
objects:
Was this page helpful?