The SDPMChunker extends semantic chunking by using a double-pass merging approach. It first groups content by semantic similarity, then merges similar groups within a skip window, allowing it to connect related content that may not be consecutive in the text. This technique is particularly useful for documents with recurring themes or concepts spread apart.

Installation

SDPMChunker requires additional dependencies for semantic capabilities. You can install it with:

pip install "chonkie[semantic]"
For installation instructions, see the Installation Guide.

Initialization

from chonkie import SDPMChunker

# Basic initialization with default parameters
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                              # Similarity threshold (0-1)
    chunk_size=512,                             # Maximum tokens per chunk
    min_sentences=1,                            # Initial sentences per chunk
    skip_window=1                               # Number of chunks to skip when looking for similarities
)

Parameters

embedding_model
Union[str, BaseEmbeddings]
default:
"minishlab/potion-base-8M"

Model identifier or embedding model instance

mode
Optional[str]
default:
"window"

Mode for grouping sentences, either “cumulative” or “window”

threshold
Optional[float, int, str]
default:
"auto"

When in the range [0,1], denotes the similarity threshold to consider sentences similar. When in the range (1,100], interprets the given value as a percentile threshold. When set to “auto”, the threshold is automatically calculated.

chunk_size
int
default:
"512"

Maximum tokens per chunk

similarity_window
int
default:
"1"

Number of sentences to consider for similarity threshold calculation

min_sentences
int
default:
"1"

Minimum number of sentences per chunk

min_chunk_size
Optional[int]
default:
"None"

Minimum tokens per chunk

min_characters_per_sentence
int
default:
"12"

Minimum number of characters per sentence

threshold_step
Optional[float]
default:
"0.01"

Step size for threshold calculation

delim
Union[str, List[str]]
default:
"['.', '!', '?', '\\n']"

Delimiters to split sentences on

skip_window
int
default:
"1"

Number of chunks to skip when looking for similarities

Usage

Single Text Chunking

text = """The neural network processes input data through layers.
Training data is essential for model performance.
GPUs accelerate neural network computations significantly.
Quality training data improves model accuracy.
TPUs provide specialized hardware for deep learning.
Data preprocessing is a crucial step in training."""

chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Batch Chunking

texts = [
    "Document with scattered but related content...",
    "Another document with similar patterns..."
]
batch_chunks = chunker.chunk_batch(texts)

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")

Supported Embeddings

SDPMChunker supports multiple embedding providers through Chonkie’s embedding system. See the Embeddings Overview for more information.

Return Type

SDPMChunker returns SemanticChunk objects with optimized storage using slots:

@dataclass
class SemanticSentence(Sentence):
    text: str
    start_index: int
    end_index: int
    token_count: int
    embedding: Optional[np.ndarray]  # Sentence embedding vector
    
    __slots__ = ['embedding']  # Optimized memory usage

@dataclass
class SemanticChunk(SentenceChunk):
    text: str
    start_index: int
    end_index: int
    token_count: int
    sentences: List[SemanticSentence]