The Semantic Chunker uses embeddings to identify natural break points in text based on semantic meaning, creating chunks where topic transitions occur.
Examples
Text Input
from chonkie.cloud import SemanticChunker
chunker = SemanticChunker(
embedding_model="minishlab/potion-base-8M",
chunk_size=512,
)
text = "Your text here..."
chunks = chunker.chunk(text)
from chonkie.cloud import SemanticChunker
chunker = SemanticChunker(
embedding_model="minishlab/potion-base-8M",
chunk_size=512,
)
# Chunk from file
with open("document.txt", "rb") as f:
chunks = chunker.chunk(file=f)
Request
Parameters
The text to chunk. Can be a single string or an array of strings for batch processing. Either text or file is required.
File to chunk. Use multipart/form-data encoding. Either text or file is required.
embedding_model
string
default:"minishlab/potion-base-8M"
The embedding model to use to detect semantic similarity.
Tokenizer to use for counting tokens.
Target number of tokens per chunk (soft limit).
Threshold for semantic similarity (0-1). Lower values create more chunks.
Minimum number of sentences per chunk.
Response
Returns
Array of Chunk objects with semantically coherent text segments.
Starting character position in the original text.
Ending character position in the original text.
Number of tokens in the chunk.