The TokenChunker splits text into chunks based on token count, ensuring each chunk stays within specified token limits.

Installation

TokenChunker is included in the base installation of Chonkie. No additional dependencies are required.

For installation instructions, see the Installation Guide.

Initialization

Here’s how to initialize the TokenChunker with the default parameters. Have a look at the Usage Examples section for more examples on how to use the chunker.

from chonkie import TokenChunker

# Basic initialization with default parameters
chunker = TokenChunker(
    tokenizer="gpt2",      # Supports string identifiers
    chunk_size=512,        # Maximum tokens per chunk
    chunk_overlap=0,       # Overlap between chunks
    return_type="chunks"   # Return type of the chunker; "chunks" or "texts"
)

Parameters

tokenizer
Union[str, tokenizers.Tokenizer, tiktoken.Encoding, transformers.PreTrainedTokenizer, Chonkie.Tokenizer]
default:"gpt2"

Tokenizer to use. Can be a string identifier or a tokenizer instance. Passing in “character” will use the character tokenizer, and it will count each character as 1 token. Passing in “word” will use the word tokenizer, and it will count each word as 1 token.

chunk_size
int
default:"512"

Maximum number of tokens per chunk

chunk_overlap
Union[int, float]
default:"0"

Number or percentage of overlapping tokens between chunks. If a float is passed, it will be interpreted as a percentage of the chunk size.

return_type
Literal['chunks', 'texts']
default:"chunks"

Return type of the chunker; “chunks” or “texts”.

Methods

__call__

The __call__ method allows you to call the chunker like a function, which uses the .chunk or .chunk_batch method internally, depending on the arguments passed.

Arguments:

text
Union[str, List[str]]
default:"None"

Text to chunk.

batch_size
int
default:"1"

Batch size for chunking.

show_progress_bar
bool
default:"True"

Whether to show a progress bar.

Returns:

Result
Union[List[Chunk], List[str], List[List[Chunk]], List[List[str]]]
default:"None"

List of chunks or texts.

.chunk

The .chunk method chunks a single text into chunks.

Arguments:

text
str
default:"None"

Text to chunk.

Returns:

Result
Union[List[Chunk], List[str]]
default:"None"

List of chunk objects or texts.

.chunk_batch

The .chunk_batch method chunks a batch of texts into chunks.

Arguments:

texts
List[str]
default:"None"

List of texts to chunk.

batch_size
int
default:"1"

Batch size for chunking.

show_progress_bar
bool
default:"True"

Whether to show a progress bar.

Returns:

Result
Union[List[Chunk], List[str]]
default:"None"

List of chunk objects or texts.

Usage Examples

Associated Return Types

TokenChunker returns chunks as Chunk objects with the following attributes:

@dataclass
class Chunk:
    text: str           # The chunk text
    start_index: int    # Starting position in original text
    end_index: int      # Ending position in original text
    token_count: int    # Number of tokens in chunk