WordChunker
Split text into chunks while maintaining word boundaries
The WordChunker
splits text into chunks while preserving word boundaries, ensuring that words stay intact and readable.
Installation
WordChunker is included in the base installation of Chonkie. No additional dependencies are required.
Initialization
Here’s how to initialize the WordChunker
with the default parameters.
Have a look at the Usage Examples section for more examples on how to use the chunker.
Parameters
Tokenizer to use. Can be a string identifier or a tokenizer instance. If a callable is passed, it will be used as the token counter. If “character” is passed, the character tokenizer will be used. If “word” is passed, the word tokenizer will be used.
Maximum number of tokens per chunk
Number of overlapping tokens between chunks
Return type of the chunker; “chunks” or “texts”
Methods
The WordChunker
class provides the following methods.
__call__
The __call__
method allows you to call the chunker like a function, which uses the .chunk
or .chunk_batch
method internally, depending on the arguments passed.
Arguments:
Text to chunk.
Whether to show a progress bar (only works if text
is a list).
Returns:
Result of the chunking process.
.chunk
The .chunk
method chunks a single text into chunks.
Arguments:
Text to chunk.
Returns:
Result of the chunking process.
.chunk_batch
The .chunk_batch
method chunks a batch of texts into chunks.
Arguments:
List of texts to chunk.
Whether to show a progress bar.
Returns:
Result of the chunking process.
Usage Examples
Associated Return Types
WordChunker returns chunks as Chunk
objects with the following attributes:
Was this page helpful?