SIMD-accelerated text chunking at 100+ GB/s throughput
The FastChunker uses memchunk for SIMD-accelerated boundary detection, enabling chunking speeds of 100+ GB/s.
Unlike other chunkers, FastChunker uses byte size limits instead of token counts.
This tradeoff enables extreme performance for high-throughput pipelines.
from chonkie import FastChunker# Split at sentence boundarieschunker = FastChunker( chunk_size=70, delimiters=".!?\n",)text = """Machine learning has transformed technology.It enables computers to learn from data.Neural networks power many modern applications.The field continues to evolve rapidly."""chunks = chunker.chunk(text)for i, chunk in enumerate(chunks): print(f"\n--- Chunk {i+1} ---") print(f"Text: {chunk.text}") print(f"Bytes: {len(chunk.text)}")
Pattern-Based Chunking (SentencePiece)
Copy
Ask AI
from chonkie import FastChunker# Split at metaspace boundaries (common in SentencePiece tokenizers)chunker = FastChunker( chunk_size=10, pattern="▁", # Metaspace character prefix=True, # Keep ▁ at start of next chunk)text = "Hello▁World▁this▁is▁a▁test▁sentence"chunks = chunker.chunk(text)for chunk in chunks: print(f"Chunk: {chunk.text}")
Handling Consecutive Delimiters
Copy
Ask AI
from chonkie import FastChunker# Split at START of consecutive whitespace runschunker = FastChunker( chunk_size=10, pattern=" ", consecutive=True,)text = """First paragraph with multiple sentences.This is still the first paragraph.Second paragraph starts here.More content in the second paragraph.""" # Multiple spaces between wordschunks = chunker.chunk(text)# Without consecutive=True: might split in middle of " "# With consecutive=True: splits at START of " "for chunk in chunks: print(f"Chunk: '{chunk.text}'")
Forward Fallback Search
Copy
Ask AI
from chonkie import FastChunker# Search forward if no delimiter found in backward windowchunker = FastChunker( chunk_size=10, pattern=" ", forward_fallback=True,)text = "verylongword short"chunks = chunker.chunk(text)# Without forward_fallback: hard split at byte 10# With forward_fallback: finds space after "verylongword"for chunk in chunks: print(f"Chunk: '{chunk.text}'")
Batch Processing
Copy
Ask AI
from chonkie import FastChunkerchunker = FastChunker(chunk_size=2048)documents = [ "First document content here...", "Second document with different content...", "Third document for processing...",]# Process all documentsbatch_results = chunker.chunk_batch(documents)for doc_idx, doc_chunks in enumerate(batch_results): print(f"\nDocument {doc_idx + 1}: {len(doc_chunks)} chunks") for chunk in doc_chunks: print(f" - {chunk.text[:30]}... ({len(chunk.text)} bytes)")
High-Throughput Pipeline
Copy
Ask AI
from chonkie import FastChunkerimport time# Configure for maximum throughputchunker = FastChunker( chunk_size=8192, delimiters="\n",)# Read a large filewith open("large_file.txt", "r") as f: large_text = f.read()# Benchmark chunking speedstart = time.perf_counter()chunks = chunker.chunk(large_text)elapsed = time.perf_counter() - startmb_size = len(large_text) / (1024 * 1024)throughput = mb_size / elapsedprint(f"Processed {mb_size:.1f} MB in {elapsed*1000:.1f}ms")print(f"Throughput: {throughput:.1f} MB/s")print(f"Chunks: {len(chunks)}")
@dataclassclass Chunk: text: str # The chunk text start_index: int # Starting byte position in original text end_index: int # Ending byte position in original text token_count: int # Always 0 (not computed for speed)
The token_count field is always 0 in FastChunker output.
If you need token counts, use the tokenizer separately or choose a different chunker.