The SemanticChunker splits text into chunks based on semantic similarity, ensuring that related content stays together in the same chunk. This approach is particularly useful for RAG applications where context preservation is crucial.

Installation

SemanticChunker requires additional dependencies for semantic capabilities. You can install it with:

pip install "chonkie[semantic]"
For installation instructions, see the Installation Guide.

Initialization

Here’s how to initialize the SemanticChunker with default parameters. In most cases, you can use the default parameters without any issues.

from chonkie import SemanticChunker

# Basic initialization with default parameters
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    mode="window",                               # Mode for grouping sentences, either "cumulative" or "window"
    threshold="auto",                            # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=512,                              # Maximum tokens per chunk
    similarity_window=1,                         # Number of sentences to consider while windowing for similarity
    min_sentences=1                              # Minimum number of sentences per chunk
    min_characters_per_sentence=12,              # Minimum number of characters per sentence
    min_chunk_size=2,                            # Minimum tokens per chunk
    threshold_step=0.01,                         # Step size for similarity threshold calculation
    delim=['.', '!', '?', '\n']                  # Delimiters to split sentences on
    return_type="chunks"                         # Return type, either "chunks" or "texts"
)

Parameters

embedding_model
Union[str, BaseEmbeddings]
default:"minishlab/potion-base-8M"

Model identifier or embedding model instance

mode
Optional[str]
default:"window"

Mode for grouping sentences, either “cumulative” or “window” Window mode will group sentences based on similarity within a window of sentences. Cumulative dynamically adjusts the window size based on the similarity of the sentences, but will take longer to process.

threshold
Optional[float, int, str]
default:"auto"

When in the range [0,1], denotes the similarity threshold to consider sentences similar. When in the range (1,100], interprets the given value as a percentile threshold. When set to “auto”, the threshold is automatically calculated.

chunk_size
int
default:"512"

Maximum tokens per chunk

similarity_window
Optional[int]
default:"1"

Number of sentences to consider for similarity threshold calculation

min_sentences
Optional[int]
default:"1"

Minimum number of sentences per chunk

min_characters_per_sentence
Optional[int]
default:"12"

Minimum number of characters per sentence

min_chunk_size
Optional[int]
default:"2"

Minimum tokens per chunk

threshold_step
Optional[float]
default:"0.01"

Step size for similarity threshold calculation

delim
Union[str, List[str]]
default:"['.', '!', '?', '\\n']"

Delimiters to split sentences on

return_type
Optional[str]
default:"chunks"

Return type, either “chunks” or “texts”

Methods

The SemanticChunker has the following methods:

__call__

The __call__ method allows you to call the chunker like a function, which uses the .chunk or .chunk_batch method internally, depending on the arguments passed.

Arguments:

text
Union[str, List[str]]
default:"None"

Text to chunk.

show_progress_bar
bool
default:"True"

Whether to show a progress bar.

Returns:

Result
Union[List[SemanticChunk], List[str], List[List[SemanticChunk]], List[List[str]]]
default:"None"

Result of the chunking process.

.chunk

The .chunk method chunks a single text into chunks.

Arguments:

text
str
default:"None"

Text to chunk.

Returns:

Result
Union[List[SemanticChunk], List[str]]
default:"None"

Result of the chunking process.

.chunk_batch

The .chunk_batch method chunks a batch of texts into chunks.

Arguments:

texts
List[str]
default:"None"

List of texts to chunk.

show_progress_bar
bool
default:"True"

Whether to show a progress bar.

Returns:

Result
Union[List[List[SemanticChunk]], List[List[str]]]
default:"None"

Result of the chunking process.

Usage Examples

Supported Embeddings

SemanticChunker supports multiple embedding providers through Chonkie’s embedding system. See the Embeddings Overview for more information.

Associated Return Type

SemanticChunker returns SemanticChunk objects:

@dataclass
class SemanticSentence(Sentence):
    text: str
    start_index: int
    end_index: int
    token_count: int
    embedding: Optional[np.ndarray]  # Sentence embedding vector

@dataclass
class SemanticChunk(SentenceChunk):
    text: str
    start_index: int
    end_index: int
    token_count: int
    sentences: List[SemanticSentence]