TokenChunker
Split text into fixed-size token chunks with configurable overlap
The TokenChunker
splits text into chunks based on token count, ensuring each chunk stays within specified token limits.
Installation
TokenChunker is included in the base installation of Chonkie. No additional dependencies are required.
Initialization
Here’s how to initialize the TokenChunker
with the default parameters.
Have a look at the Usage Examples section for more examples on how to use the chunker.
Parameters
Tokenizer to use. Can be a string identifier or a tokenizer instance. Passing in “character” will use the character tokenizer, and it will count each character as 1 token. Passing in “word” will use the word tokenizer, and it will count each word as 1 token.
Maximum number of tokens per chunk
Number or percentage of overlapping tokens between chunks. If a float is passed, it will be interpreted as a percentage of the chunk size.
Return type of the chunker; “chunks” or “texts”.
Methods
__call__
The __call__
method allows you to call the chunker like a function, which uses the .chunk
or .chunk_batch
method internally, depending on the arguments passed.
Arguments:
Text to chunk.
Batch size for chunking.
Whether to show a progress bar.
Returns:
List of chunks or texts.
.chunk
The .chunk
method chunks a single text into chunks.
Arguments:
Text to chunk.
Returns:
List of chunk objects or texts.
.chunk_batch
The .chunk_batch
method chunks a batch of texts into chunks.
Arguments:
List of texts to chunk.
Batch size for chunking.
Whether to show a progress bar.
Returns:
List of chunk objects or texts.
Usage Examples
Associated Return Types
TokenChunker returns chunks as Chunk
objects with the following attributes:
Was this page helpful?