Skip to main content
The FastChunker uses memchunk for SIMD-accelerated boundary detection, enabling chunking speeds of 100+ GB/s.
Unlike other chunkers, FastChunker uses byte size limits instead of token counts. This tradeoff enables extreme performance for high-throughput pipelines.

Installation

FastChunker requires the memchunk library:
pip install chonkie[fast]

Initialization

from chonkie import FastChunker

# Basic initialization with default parameters
chunker = FastChunker(
    chunk_size=4096,      # Target size in BYTES (not tokens)
    delimiters="\n.?",    # Split at newlines, periods, question marks
)

# Split at paragraph boundaries
chunker = FastChunker(
    chunk_size=8192,
    delimiters="\n\n",
)

# Pattern-based splitting (e.g., for SentencePiece tokenizers)
chunker = FastChunker(
    chunk_size=4096,
    pattern="▁",          # Metaspace character
    prefix=True,          # Keep pattern at start of next chunk
)

Parameters

chunk_size
int
default:"4096"
Target chunk size in bytes (not tokens)
delimiters
str
default:"\\n.?"
Single-byte delimiter characters to split on
pattern
str
default:"None"
Multi-byte pattern to split on (overrides delimiters if set)
prefix
bool
default:"False"
If True, keep the delimiter/pattern at the start of the next chunk instead of the end of the current chunk
consecutive
bool
default:"False"
If True, split at the START of consecutive delimiter runs instead of the middle
forward_fallback
bool
default:"False"
If True, search forward for a delimiter when none is found in the backward search window

Basic Usage

from chonkie import FastChunker

# Initialize the chunker
chunker = FastChunker(
    chunk_size=1024,
    delimiters=". \n",
)

# Chunk your text
text = "Your long document text here..."
chunks = chunker.chunk(text)

# Access chunk information
for chunk in chunks:
    print(f"Chunk: {chunk.text[:50]}...")
    print(f"Bytes: {len(chunk.text)}")
    print(f"Position: {chunk.start_index}-{chunk.end_index}")

Examples

from chonkie import FastChunker

# Split at sentence boundaries
chunker = FastChunker(
    chunk_size=70,
    delimiters=".!?\n",
)

text = """Machine learning has transformed technology.
It enables computers to learn from data.
Neural networks power many modern applications.
The field continues to evolve rapidly."""

chunks = chunker.chunk(text)

for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---")
    print(f"Text: {chunk.text}")
    print(f"Bytes: {len(chunk.text)}")
from chonkie import FastChunker

# Split at metaspace boundaries (common in SentencePiece tokenizers)
chunker = FastChunker(
    chunk_size=10,
    pattern="▁",      # Metaspace character
    prefix=True,      # Keep ▁ at start of next chunk
)

text = "Hello▁World▁this▁is▁a▁test▁sentence"
chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk: {chunk.text}")
from chonkie import FastChunker

# Split at START of consecutive whitespace runs
chunker = FastChunker(
    chunk_size=10,
    pattern=" ",
    consecutive=True,
)

text = """First           paragraph with multiple sentences.
This is still the first paragraph.

Second paragraph starts here.
More content         in the second paragraph."""  # Multiple spaces between words
chunks = chunker.chunk(text)

# Without consecutive=True: might split in middle of "   "
# With consecutive=True: splits at START of "   "
for chunk in chunks:
    print(f"Chunk: '{chunk.text}'")
from chonkie import FastChunker

chunker = FastChunker(chunk_size=2048)

documents = [
    "First document content here...",
    "Second document with different content...",
    "Third document for processing...",
]

# Process all documents
batch_results = chunker.chunk_batch(documents)

for doc_idx, doc_chunks in enumerate(batch_results):
    print(f"\nDocument {doc_idx + 1}: {len(doc_chunks)} chunks")
    for chunk in doc_chunks:
        print(f"  - {chunk.text[:30]}... ({len(chunk.text)} bytes)")
from chonkie import FastChunker
import time

# Configure for maximum throughput
chunker = FastChunker(
    chunk_size=8192,
    delimiters="\n",
)

# Read a large file
with open("large_file.txt", "r") as f:
    large_text = f.read()

# Benchmark chunking speed
start = time.perf_counter()
chunks = chunker.chunk(large_text)
elapsed = time.perf_counter() - start

mb_size = len(large_text) / (1024 * 1024)
throughput = mb_size / elapsed

print(f"Processed {mb_size:.1f} MB in {elapsed*1000:.1f}ms")
print(f"Throughput: {throughput:.1f} MB/s")
print(f"Chunks: {len(chunks)}")

Comparison with Other Chunkers

FeatureFastChunkerTokenChunkerSentenceChunker
Size unitBytesTokensTokens
Tokenizer requiredNoYesYes
token_countAlways 0ComputedComputed
Speed~100+ GB/sTokenizer-boundTokenizer-bound
Best forHigh-throughput pipelinesToken-precise chunkingSemantic boundaries

When to Use FastChunker

Use FastChunker when:
  • Processing large volumes of text (>100KB documents)
  • Building high-throughput pipelines
  • Byte-level precision is acceptable
  • You don’t need exact token counts
Use other chunkers when:
  • You need precise token counts for LLM context limits
  • Working with small documents (< 1KB)
  • Complex semantic boundaries are required

Return Type

FastChunker returns chunks as Chunk objects:
@dataclass
class Chunk:
    text: str           # The chunk text
    start_index: int    # Starting byte position in original text
    end_index: int      # Ending byte position in original text
    token_count: int    # Always 0 (not computed for speed)
The token_count field is always 0 in FastChunker output. If you need token counts, use the tokenizer separately or choose a different chunker.