The SemanticChunker splits text into chunks based on semantic similarity, ensuring that related content stays together in the same chunk. This approach is particularly useful for RAG applications where context preservation is crucial.

Installation

SemanticChunker requires additional dependencies for semantic capabilities. You can install it with:

pip install "chonkie[semantic]"
For installation instructions, see the Installation Guide.

Initialization

from chonkie import SemanticChunker

# Basic initialization with default parameters
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                               # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=512,                              # Maximum tokens per chunk
    min_sentences=1                              # Initial sentences per chunk
)

# Using a custom embedding model
from chonkie.embeddings import BaseEmbeddings

class CustomEmbeddings(BaseEmbeddings):
    # Implement required methods...
    pass

custom_embeddings = CustomEmbeddings()
chunker = SemanticChunker(
    embedding_model=custom_embeddings,
    threshold=0.5,
    chunk_size=512
)

Parameters

embedding_model
Union[str, BaseEmbeddings]
default: "minishlab/potion-base-8M"

Model identifier or embedding model instance

mode
Optional[str]
default: "window"

Mode for grouping sentences, either “cumulative” or “window”

threshold
Optional[float, int, str]
default: "auto"

When in the range [0,1], denotes the similarity threshold to consider sentences similar. When in the range (1,100], interprets the given value as a percentile threshold. When set to “auto”, the threshold is automatically calculated.

chunk_size
int
default: "512"

Maximum tokens per chunk

similarity_window
int
default: "1"

Number of sentences to consider for similarity threshold calculation

min_sentences
int
default: "1"

Minimum number of sentences per chunk

min_characters_per_sentence
int
default: "12"

Minimum number of characters per sentence

min_chunk_size
Optional[int]
default: "None"

Minimum tokens per chunk

threshold_step
Optional[float]
default: "0.01"

Step size for similarity threshold calculation

delim
Union[str, List[str]]
default: "['.', '!', '?', '\\n']"

Delimiters to split sentences on

Usage

Single Text Chunking

text = """First paragraph about a specific topic.
Second paragraph continuing the same topic.
Third paragraph switching to a different topic.
Fourth paragraph expanding on the new topic."""

chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Batch Chunking

texts = [
    "First document about topic A...",
    "Second document about topic B..."
]
batch_chunks = chunker.chunk_batch(texts)

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")

Supported Embeddings

SemanticChunker supports multiple embedding providers through Chonkie’s embedding system. See the Embeddings Overview for more information.

Return Type

SemanticChunker returns SemanticChunk objects with optimized storage using slots:

@dataclass
class SemanticSentence(Sentence):
    text: str
    start_index: int
    end_index: int
    token_count: int
    embedding: Optional[np.ndarray]  # Sentence embedding vector
    
    __slots__ = ['embedding']  # Optimized memory usage

@dataclass
class SemanticChunk(SentenceChunk):
    text: str
    start_index: int
    end_index: int
    token_count: int
    sentences: List[SemanticSentence]