The TokenChunker
splits text into chunks based on token count, ensuring each chunk stays within specified token limits.
API Reference
To use the TokenChunker
via the API, check out the API reference documentation.
Installation
TokenChunker is included in the base installation of Chonkie. No additional dependencies are required.
Initialization
from chonkie import TokenChunker
# Basic initialization with default parameters
chunker = TokenChunker(
tokenizer="gpt2", # Supports string identifiers
chunk_size=512, # Maximum tokens per chunk
chunk_overlap=128 # Overlap between chunks
)
# Using a custom tokenizer
from tokenizers import Tokenizer
custom_tokenizer = Tokenizer.from_pretrained("your-tokenizer")
chunker = TokenChunker(
tokenizer=custom_tokenizer,
chunk_size=512,
chunk_overlap=128
)
Parameters
tokenizer
Union[str, Any]
default:"gpt2"
Tokenizer to use. Can be a string identifier or a tokenizer instance
Maximum number of tokens per chunk
chunk_overlap
Union[int, float]
default:"0"
Number or percentage of overlapping tokens between chunks
return_type
Literal['chunks', 'texts']
default:"chunks"
Whether to return chunks as Chunk
objects or plain text strings
Single Text Chunking
text = "Some long text that needs to be chunked into smaller pieces..."
chunks = chunker.chunk(text)
for chunk in chunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
print(f"Start index: {chunk.start_index}")
print(f"End index: {chunk.end_index}")
Batch Chunking
texts = [
"First document to chunk...",
"Second document to chunk..."
]
batch_chunks = chunker.chunk_batch(texts)
for doc_chunks in batch_chunks:
for chunk in doc_chunks:
print(f"Chunk: {chunk.text}")
Using as a Callable
# Single text
chunks = chunker("Text to chunk...")
# Multiple texts
batch_chunks = chunker(["Text 1...", "Text 2..."])
Supported Tokenizers
TokenChunker supports multiple tokenizer backends:
-
TikToken (Recommended)
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
-
AutoTikTokenizer
from autotiktokenizer import AutoTikTokenizer
tokenizer = AutoTikTokenizer.from_pretrained("gpt2")
-
Hugging Face Tokenizers
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("gpt2")
-
Transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
Return Type
TokenChunker returns chunks as Chunk
objects.
Chunks object include a custom Context
class for additional metadata alongside other attributes:
@dataclass
class Context:
text: str
token_count: int
start_index: Optional[int] = None
end_index: Optional[int] = None
@dataclass
class Chunk:
text: str # The chunk text
start_index: int # Starting position in original text
end_index: int # Ending position in original text
token_count: int # Number of tokens in chunk
context: Optional[Context] # Contextual information (if any)