TokenChunker
Split text into fixed-size token chunks with configurable overlap
The TokenChunker
splits text into chunks based on token count, ensuring each chunk stays within specified token limits. It is ideal for preparing text for models with token limits, or for consistent chunking across different texts.
Installation
TokenChunker is included in the base installation of Chonkie TS. No additional dependencies are required.
Initialization
Parameters
Tokenizer to use. Can be a string identifier (model name) or a Tokenizer instance. Defaults to using Xenova/gpt2
tokenizer.
Maximum number of tokens per chunk.
Number or percentage of overlapping tokens between chunks. Can be an absolute number (e.g., 16) or a decimal between 0 and 1 (e.g., 0.1 for 10% overlap).
Whether to return chunks as Chunk
objects (with metadata) or plain text strings.
Usage
Single Text Chunking
Batch Chunking
Using as a Callable
Return Type
TokenChunker returns chunks as Chunk
objects by default. Each chunk includes metadata:
If returnType
is set to 'texts'
, only the chunked text strings are returned.
For more details, see the TypeScript API Reference.