The SDPMChunker extends semantic chunking by using a double-pass merging approach. It first groups content by semantic similarity, then merges similar groups within a skip window, allowing it to connect related content that may not be consecutive in the text. This technique is particularly useful for documents with recurring themes or concepts spread apart.

Installation

SDPMChunker requires additional dependencies for semantic capabilities. You can install it with:

pip install "chonkie[semantic]"
For installation instructions, see the Installation Guide.

Initialization

from chonkie import SDPMChunker

# Basic initialization with default parameters
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    similarity_threshold=0.5,                   # Similarity threshold (0-1)
    chunk_size=512,                             # Maximum tokens per chunk
    initial_sentences=1,                        # Initial sentences per chunk
    skip_window=1                               # Number of chunks to skip when looking for similarities
)

# Using a custom embedding model
from chonkie.embeddings import BaseEmbeddings

class CustomEmbeddings(BaseEmbeddings):
    # Implement required methods...
    pass

custom_embeddings = CustomEmbeddings()
chunker = SDPMChunker(
    embedding_model=custom_embeddings,
    similarity_threshold=0.5,
    chunk_size=512
)

Parameters

embedding_model
Union[str, BaseEmbeddings]
default: "minishlab/potion-base-8M"

Model identifier or embedding model instance

similarity_threshold
Optional[float]
default: "None"

Minimum similarity score (0-1) to consider sentences similar

similarity_percentile
Optional[float]
default: "0.8"

Percentile-based threshold (0-1) for similarity

chunk_size
int
default: "512"

Maximum tokens per chunk

min_chunk_size
Optional[int]
default: "None"

Minimum tokens per chunk

initial_sentences
int
default: "1"

Number of sentences to start each chunk with

skip_window
int
default: "1"

Number of chunks to skip when looking for similarities

Usage

Single Text Chunking

text = """The neural network processes input data through layers.
Training data is essential for model performance.
GPUs accelerate neural network computations significantly.
Quality training data improves model accuracy.
TPUs provide specialized hardware for deep learning.
Data preprocessing is a crucial step in training."""

chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Batch Chunking

texts = [
    "Document with scattered but related content...",
    "Another document with similar patterns..."
]
batch_chunks = chunker.chunk_batch(texts)

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")

Supported Embeddings

SDPMChunker supports multiple embedding providers through Chonkie’s embedding system. See the Embeddings Overview for more information.

Return Type

SDPMChunker returns SemanticChunk objects with optimized storage using slots:

@dataclass
class SemanticSentence(Sentence):
    text: str
    start_index: int
    end_index: int
    token_count: int
    embedding: Optional[np.ndarray]  # Sentence embedding vector
    
    __slots__ = ['embedding']  # Optimized memory usage

@dataclass
class SemanticChunk(SentenceChunk):
    text: str
    start_index: int
    end_index: int
    token_count: int
    sentences: List[SemanticSentence]