The Token Chunker splits text into chunks based on token count, ensuring each chunk stays within specified token limits.
Examples
Text Input
from chonkie.cloud import TokenChunker
chunker = TokenChunker(
tokenizer="gpt2",
chunk_size=512,
chunk_overlap=128
)
text = "Your text here..."
chunks = chunker.chunk(text)
from chonkie.cloud import TokenChunker
chunker = TokenChunker(
tokenizer="gpt2",
chunk_size=512,
chunk_overlap=128
)
# Chunk from file
with open("document.txt", "rb") as f:
chunks = chunker.chunk(file=f)
Request
Parameters
The text to chunk. Can be a single string or an array of strings for batch processing. Either text or file is required.
File to chunk. Use multipart/form-data encoding. Either text or file is required.
Tokenizer to use for counting tokens. Options: “gpt2”, “character”, “word”, or any Hugging Face tokenizer.
Maximum number of tokens per chunk.
Number of tokens to overlap between consecutive chunks.
Response
Returns
Array of Chunk objects, each containing:
Starting character position in the original text.
Ending character position in the original text.
Number of tokens in the chunk.