Skip to main content
{
  "text": "<string>",
  "start_index": 123,
  "end_index": 123,
  "token_count": 123
}
The Semantic Chunker uses embeddings to identify natural break points in text based on semantic meaning, creating chunks where topic transitions occur.

Examples

Text Input

from chonkie.cloud import SemanticChunker

chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",
    chunk_size=512,
)

text = "Your text here..."
chunks = chunker.chunk(text)

File Input

from chonkie.cloud import SemanticChunker

chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",
    chunk_size=512,
)

# Chunk from file
with open("document.txt", "rb") as f:
    chunks = chunker.chunk(file=f)

Request

Parameters

text
string | string[]
The text to chunk. Can be a single string or an array of strings for batch processing. Either text or file is required.
file
file
File to chunk. Use multipart/form-data encoding. Either text or file is required.
embedding_model
string
default:"minishlab/potion-base-8M"
The embedding model to use to detect semantic similarity.
tokenizer
string
default:"gpt2"
Tokenizer to use for counting tokens.
chunk_size
integer
default:"512"
Target number of tokens per chunk (soft limit).
threshold
float
default:"0.8"
Threshold for semantic similarity (0-1). Lower values create more chunks.
min_sentences_per_chunk
integer
default:"1"
Minimum number of sentences per chunk.

Response

Returns

Array of Chunk objects with semantically coherent text segments.
text
string
The chunk text content.
start_index
integer
Starting character position in the original text.
end_index
integer
Ending character position in the original text.
token_count
integer
Number of tokens in the chunk.
I