> ## Documentation Index > Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt > Use this file to discover all available pages before exploring further. # Fast Chunker > SIMD-accelerated text chunking at 100+ GB/s throughput The `FastChunker` uses [chonkie-core](https://github.com/chonkie-inc/chunk) for SIMD-accelerated boundary detection, enabling chunking speeds of 100+ GB/s. Unlike other chunkers, FastChunker uses **byte size** limits instead of token counts. This tradeoff enables extreme performance for high-throughput pipelines. ## Initialization ```python Basic initialization with default parameters theme={"system"} from chonkie import FastChunker chunker = FastChunker( chunk_size=4096, # Target size in BYTES (not tokens) delimiters="\n.?", # Split at newlines, periods, question marks ) ``` ```python Split at paragraph boundaries theme={"system"} chunker = FastChunker( chunk_size=8192, delimiters="\n\n", ) ``` ```python Pattern-based splitting (e.g., for SentencePiece tokenizers) theme={"system"} chunker = FastChunker( chunk_size=4096, pattern="▁", # Metaspace character prefix=True, # Keep pattern at start of next chunk ) ``` ```javascript Basic initialization with default parameters theme={"system"} import { FastChunker } from "@chonkiejs/core"; let chunker = await FastChunker.create({ chunkSize: 4096, // Target size in BYTES (not tokens) delimiters: "\n.?", // Split at newlines, periods, question marks }); ``` ```javascript Split at paragraph boundaries theme={"system"} chunker = await FastChunker.create({ chunkSize: 8192, delimiters: "\n\n", }); ``` ```javascript Pattern-based splitting (e.g., for SentencePiece tokenizers) theme={"system"} chunker = await FastChunker.create({ chunkSize: 4096, pattern: "▁", // Metaspace character prefix: true, // Keep pattern at start of next chunk }); ``` ## Parameters Target chunk size in **bytes** (not tokens) Single-byte delimiter characters to split on Multi-byte pattern to split on (overrides delimiters if set) If True, keep the delimiter/pattern at the start of the next chunk instead of the end of the current chunk If True, split at the START of consecutive delimiter runs instead of the middle If True, search forward for a delimiter when none is found in the backward search window ## Basic Usage ```python theme={"system"} from chonkie import FastChunker # Initialize the chunker chunker = FastChunker( chunk_size=1024, delimiters=". \n", ) # Chunk your text text = "Your long document text here..." chunks = chunker.chunk(text) # Access chunk information for chunk in chunks: print(f"Chunk: {chunk.text[:50]}...") print(f"Bytes: {len(chunk.text)}") print(f"Position: {chunk.start_index}-{chunk.end_index}") ``` ```javascript theme={"system"} import { FastChunker } from "@chonkiejs/core"; // Initialize the chunker const chunker = await FastChunker.create({ chunkSize: 1024, delimiters: ". \n", }); // Chunk your text const text = "Your long document text here..."; const chunks = await chunker.chunk(text); // Access chunk information for (const chunk of chunks) { console.log(`Chunk: ${chunk.text.slice(0, 50)}...`); console.log(`Bytes: ${chunk.text.length}`); console.log(`Position: ${chunk.startIndex}-${chunk.endIndex}`); } ``` ## Examples ```python theme={"system"} from chonkie import FastChunker # Split at sentence boundaries chunker = FastChunker( chunk_size=70, delimiters=".!?\n", ) text = """Machine learning has transformed technology. It enables computers to learn from data. Neural networks power many modern applications. The field continues to evolve rapidly.""" chunks = chunker.chunk(text) for i, chunk in enumerate(chunks): print(f"\n--- Chunk {i+1} ---") print(f"Text: {chunk.text}") print(f"Bytes: {len(chunk.text)}") ``` ```python theme={"system"} from chonkie import FastChunker # Split at metaspace boundaries (common in SentencePiece tokenizers) chunker = FastChunker( chunk_size=10, pattern="▁", # Metaspace character prefix=True, # Keep ▁ at start of next chunk ) text = "Hello▁World▁this▁is▁a▁test▁sentence" chunks = chunker.chunk(text) for chunk in chunks: print(f"Chunk: {chunk.text}") ``` ```python theme={"system"} from chonkie import FastChunker # Split at START of consecutive whitespace runs chunker = FastChunker( chunk_size=10, pattern=" ", consecutive=True, ) text = """First paragraph with multiple sentences. This is still the first paragraph. Second paragraph starts here. More content in the second paragraph.""" # Multiple spaces between words chunks = chunker.chunk(text) # Without consecutive=True: might split in middle of " " # With consecutive=True: splits at START of " " for chunk in chunks: print(f"Chunk: '{chunk.text}'") ``` ```python theme={"system"} from chonkie import FastChunker # Search forward if no delimiter found in backward window chunker = FastChunker( chunk_size=10, pattern=" ", forward_fallback=True, ) text = "verylongword short" chunks = chunker.chunk(text) # Without forward_fallback: hard split at byte 10 # With forward_fallback: finds space after "verylongword" for chunk in chunks: print(f"Chunk: '{chunk.text}'") ``` ```python theme={"system"} from chonkie import FastChunker chunker = FastChunker(chunk_size=2048) documents = [ "First document content here...", "Second document with different content...", "Third document for processing...", ] # Process all documents batch_results = chunker.chunk_batch(documents) for doc_idx, doc_chunks in enumerate(batch_results): print(f"\nDocument {doc_idx + 1}: {len(doc_chunks)} chunks") for chunk in doc_chunks: print(f" - {chunk.text[:30]}... ({len(chunk.text)} bytes)") ``` ```python theme={"system"} from chonkie import FastChunker import time # Configure for maximum throughput chunker = FastChunker( chunk_size=8192, delimiters="\n", ) # Read a large file with open("large_file.txt", "r") as f: large_text = f.read() # Benchmark chunking speed start = time.perf_counter() chunks = chunker.chunk(large_text) elapsed = time.perf_counter() - start mb_size = len(large_text) / (1024 * 1024) throughput = mb_size / elapsed print(f"Processed {mb_size:.1f} MB in {elapsed*1000:.1f}ms") print(f"Throughput: {throughput:.1f} MB/s") print(f"Chunks: {len(chunks)}") ``` ## Comparison with Other Chunkers | Feature | FastChunker | TokenChunker | SentenceChunker | | ------------------ | ------------------------- | ---------------------- | ------------------- | | Size unit | Bytes | Tokens | Tokens | | Tokenizer required | No | Yes | Yes | | `token_count` | Always 0 | Computed | Computed | | Speed | \~100+ GB/s | Tokenizer-bound | Tokenizer-bound | | Best for | High-throughput pipelines | Token-precise chunking | Semantic boundaries | ## When to Use FastChunker **Use FastChunker when:** * Processing large volumes of text (>100KB documents) * Building high-throughput pipelines * Byte-level precision is acceptable * You don't need exact token counts **Use other chunkers when:** * You need precise token counts for LLM context limits * Working with small documents (\< 1KB) * Complex semantic boundaries are required ## Return Type FastChunker returns chunks as `Chunk` objects: ```python theme={"system"} @dataclass class Chunk: text: str # The chunk text start_index: int # Starting character position in original text end_index: int # Ending character position in original text token_count: int # Always 0 (not computed for speed) context: Optional[str] = None # Optional overlap context text embedding: Union[list[float], "np.ndarray", None] = None # Optional embedding vector ``` The `token_count` field is always 0 in FastChunker output. If you need token counts, use the tokenizer separately or choose a different chunker.