Chonkie Documentation

The FastChunker uses chonkie-core for SIMD-accelerated boundary detection, enabling chunking speeds of 100+ GB/s.

Unlike other chunkers, FastChunker uses byte size limits instead of token counts. This tradeoff enables extreme performance for high-throughput pipelines.

Initialization

from chonkie import FastChunker

# Basic initialization with default parameters
chunker = FastChunker(
    chunk_size=4096,      # Target size in BYTES (not tokens)
    delimiters="\n.?",    # Split at newlines, periods, question marks
)

# Split at paragraph boundaries
chunker = FastChunker(
    chunk_size=8192,
    delimiters="\n\n",
)

# Pattern-based splitting (e.g., for SentencePiece tokenizers)
chunker = FastChunker(
    chunk_size=4096,
    pattern="▁",          # Metaspace character
    prefix=True,          # Keep pattern at start of next chunk
)

Parameters

chunk_size

int

default:"4096"

Target chunk size in bytes (not tokens)

delimiters

str

default:"\\n.?"

Single-byte delimiter characters to split on

pattern

str

default:"None"

Multi-byte pattern to split on (overrides delimiters if set)

prefix

bool

default:"False"

If True, keep the delimiter/pattern at the start of the next chunk instead of the end of the current chunk

consecutive

bool

default:"False"

If True, split at the START of consecutive delimiter runs instead of the middle

forward_fallback

bool

default:"False"

If True, search forward for a delimiter when none is found in the backward search window

Basic Usage

from chonkie import FastChunker

# Initialize the chunker
chunker = FastChunker(
    chunk_size=1024,
    delimiters=". \n",
)

# Chunk your text
text = "Your long document text here..."
chunks = chunker.chunk(text)

# Access chunk information
for chunk in chunks:
    print(f"Chunk: {chunk.text[:50]}...")
    print(f"Bytes: {len(chunk.text)}")
    print(f"Position: {chunk.start_index}-{chunk.end_index}")

Examples

Sentence-Based Chunking

from chonkie import FastChunker

# Split at sentence boundaries
chunker = FastChunker(
    chunk_size=70,
    delimiters=".!?\n",
)

text = """Machine learning has transformed technology.
It enables computers to learn from data.
Neural networks power many modern applications.
The field continues to evolve rapidly."""

chunks = chunker.chunk(text)

for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---")
    print(f"Text: {chunk.text}")
    print(f"Bytes: {len(chunk.text)}")

Pattern-Based Chunking (SentencePiece)

from chonkie import FastChunker

# Split at metaspace boundaries (common in SentencePiece tokenizers)
chunker = FastChunker(
    chunk_size=10,
    pattern="▁",      # Metaspace character
    prefix=True,      # Keep ▁ at start of next chunk
)

text = "Hello▁World▁this▁is▁a▁test▁sentence"
chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Chunk: {chunk.text}")

Handling Consecutive Delimiters

from chonkie import FastChunker

# Split at START of consecutive whitespace runs
chunker = FastChunker(
    chunk_size=10,
    pattern=" ",
    consecutive=True,
)

text = """First           paragraph with multiple sentences.
This is still the first paragraph.

Second paragraph starts here.
More content         in the second paragraph."""  # Multiple spaces between words
chunks = chunker.chunk(text)

# Without consecutive=True: might split in middle of "   "
# With consecutive=True: splits at START of "   "
for chunk in chunks:
    print(f"Chunk: '{chunk.text}'")

Forward Fallback Search

from chonkie import FastChunker

# Search forward if no delimiter found in backward window
chunker = FastChunker(
    chunk_size=10,
    pattern=" ",
    forward_fallback=True,
)

text = "verylongword short"
chunks = chunker.chunk(text)

# Without forward_fallback: hard split at byte 10
# With forward_fallback: finds space after "verylongword"
for chunk in chunks:
    print(f"Chunk: '{chunk.text}'")

Batch Processing

from chonkie import FastChunker

chunker = FastChunker(chunk_size=2048)

documents = [
    "First document content here...",
    "Second document with different content...",
    "Third document for processing...",
]

# Process all documents
batch_results = chunker.chunk_batch(documents)

for doc_idx, doc_chunks in enumerate(batch_results):
    print(f"\nDocument {doc_idx + 1}: {len(doc_chunks)} chunks")
    for chunk in doc_chunks:
        print(f"  - {chunk.text[:30]}... ({len(chunk.text)} bytes)")

High-Throughput Pipeline

from chonkie import FastChunker
import time

# Configure for maximum throughput
chunker = FastChunker(
    chunk_size=8192,
    delimiters="\n",
)

# Read a large file
with open("large_file.txt", "r") as f:
    large_text = f.read()

# Benchmark chunking speed
start = time.perf_counter()
chunks = chunker.chunk(large_text)
elapsed = time.perf_counter() - start

mb_size = len(large_text) / (1024 * 1024)
throughput = mb_size / elapsed

print(f"Processed {mb_size:.1f} MB in {elapsed*1000:.1f}ms")
print(f"Throughput: {throughput:.1f} MB/s")
print(f"Chunks: {len(chunks)}")

Comparison with Other Chunkers

Feature	FastChunker	TokenChunker	SentenceChunker
Size unit	Bytes	Tokens	Tokens
Tokenizer required	No	Yes	Yes
`token_count`	Always 0	Computed	Computed
Speed	~100+ GB/s	Tokenizer-bound	Tokenizer-bound
Best for	High-throughput pipelines	Token-precise chunking	Semantic boundaries

When to Use FastChunker

Use FastChunker when:

Processing large volumes of text (>100KB documents)
Building high-throughput pipelines
Byte-level precision is acceptable
You don’t need exact token counts

Use other chunkers when:

You need precise token counts for LLM context limits
Working with small documents (< 1KB)
Complex semantic boundaries are required

Return Type

FastChunker returns chunks as Chunk objects:

@dataclass
class Chunk:
    text: str                                           # The chunk text
    start_index: int                                    # Starting character position in original text
    end_index: int                                      # Ending character position in original text
    token_count: int                                    # Always 0 (not computed for speed)
    context: Optional[str] = None                       # Optional overlap context text
    embedding: Union[list[float], "np.ndarray", None] = None  # Optional embedding vector

The token_count field is always 0 in FastChunker output. If you need token counts, use the tokenizer separately or choose a different chunker.

Getting Started

Chefs

Fetchers

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Deprecated

Changelog

Fast Chunker

Initialization

Parameters

Basic Usage

Examples

Comparison with Other Chunkers

When to Use FastChunker

Return Type

Getting Started

Chefs

Fetchers

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Deprecated

Changelog

​Initialization

​Parameters

​Basic Usage

​Examples

​Comparison with Other Chunkers

​When to Use FastChunker

​Return Type

Initialization

Parameters

Basic Usage

Examples

Comparison with Other Chunkers

When to Use FastChunker

Return Type