> ## Documentation Index
> Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Fast Chunker

> SIMD-accelerated text chunking at 100+ GB/s throughput

The `FastChunker` uses [chonkie-core](https://github.com/chonkie-inc/chunk) for SIMD-accelerated boundary detection, enabling chunking speeds of 100+ GB/s.

<Warning>
  Unlike other chunkers, FastChunker uses **byte size** limits instead of token counts.
  This tradeoff enables extreme performance for high-throughput pipelines.
</Warning>

## Initialization

<Tabs>
  <Tab title="Python">
    <CodeGroup>
      ```python Basic initialization with default parameters theme={"system"}
      from chonkie import FastChunker

      chunker = FastChunker(
          chunk_size=4096,      # Target size in BYTES (not tokens)
          delimiters="\n.?",    # Split at newlines, periods, question marks
      )
      ```

      ```python Split at paragraph boundaries theme={"system"}
      chunker = FastChunker(
          chunk_size=8192,
          delimiters="\n\n",
      )
      ```

      ```python Pattern-based splitting (e.g., for SentencePiece tokenizers) theme={"system"}
      chunker = FastChunker(
          chunk_size=4096,
          pattern="▁",          # Metaspace character
          prefix=True,          # Keep pattern at start of next chunk
      )
      ```
    </CodeGroup>
  </Tab>

  <Tab title="JavaScript">
    <CodeGroup>
      ```javascript Basic initialization with default parameters theme={"system"}
      import { FastChunker } from "@chonkiejs/core";

      let chunker = await FastChunker.create({
        chunkSize: 4096,      // Target size in BYTES (not tokens)
        delimiters: "\n.?",   // Split at newlines, periods, question marks
      });
      ```

      ```javascript Split at paragraph boundaries theme={"system"}
      chunker = await FastChunker.create({
        chunkSize: 8192,
        delimiters: "\n\n",
      });
      ```

      ```javascript Pattern-based splitting (e.g., for SentencePiece tokenizers) theme={"system"}
      chunker = await FastChunker.create({
        chunkSize: 4096,
        pattern: "▁",         // Metaspace character
        prefix: true,         // Keep pattern at start of next chunk
      });
      ```
    </CodeGroup>
  </Tab>
</Tabs>

## Parameters

<ParamField path="chunk_size" type="int" default="4096">
  Target chunk size in **bytes** (not tokens)
</ParamField>

<ParamField path="delimiters" type="str" default="\n.?">
  Single-byte delimiter characters to split on
</ParamField>

<ParamField path="pattern" type="str" default="None">
  Multi-byte pattern to split on (overrides delimiters if set)
</ParamField>

<ParamField path="prefix" type="bool" default="False">
  If True, keep the delimiter/pattern at the start of the next chunk instead of the end of the current chunk
</ParamField>

<ParamField path="consecutive" type="bool" default="False">
  If True, split at the START of consecutive delimiter runs instead of the middle
</ParamField>

<ParamField path="forward_fallback" type="bool" default="False">
  If True, search forward for a delimiter when none is found in the backward search window
</ParamField>

## Basic Usage

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    from chonkie import FastChunker

    # Initialize the chunker
    chunker = FastChunker(
        chunk_size=1024,
        delimiters=". \n",
    )

    # Chunk your text
    text = "Your long document text here..."
    chunks = chunker.chunk(text)

    # Access chunk information
    for chunk in chunks:
        print(f"Chunk: {chunk.text[:50]}...")
        print(f"Bytes: {len(chunk.text)}")
        print(f"Position: {chunk.start_index}-{chunk.end_index}")
    ```
  </Tab>

  <Tab title="JavaScript">
    ```javascript theme={"system"}
    import { FastChunker } from "@chonkiejs/core";

    // Initialize the chunker
    const chunker = await FastChunker.create({
      chunkSize: 1024,
      delimiters: ". \n",
    });

    // Chunk your text
    const text = "Your long document text here...";
    const chunks = await chunker.chunk(text);

    // Access chunk information
    for (const chunk of chunks) {
      console.log(`Chunk: ${chunk.text.slice(0, 50)}...`);
      console.log(`Bytes: ${chunk.text.length}`);
      console.log(`Position: ${chunk.startIndex}-${chunk.endIndex}`);
    }
    ```
  </Tab>
</Tabs>

## Examples

<AccordionGroup>
  <Accordion title="Sentence-Based Chunking">
    ```python theme={"system"}
    from chonkie import FastChunker

    # Split at sentence boundaries
    chunker = FastChunker(
        chunk_size=70,
        delimiters=".!?\n",
    )

    text = """Machine learning has transformed technology.
    It enables computers to learn from data.
    Neural networks power many modern applications.
    The field continues to evolve rapidly."""

    chunks = chunker.chunk(text)

    for i, chunk in enumerate(chunks):
        print(f"\n--- Chunk {i+1} ---")
        print(f"Text: {chunk.text}")
        print(f"Bytes: {len(chunk.text)}")
    ```
  </Accordion>

  <Accordion title="Pattern-Based Chunking (SentencePiece)">
    ```python theme={"system"}
    from chonkie import FastChunker

    # Split at metaspace boundaries (common in SentencePiece tokenizers)
    chunker = FastChunker(
        chunk_size=10,
        pattern="▁",      # Metaspace character
        prefix=True,      # Keep ▁ at start of next chunk
    )

    text = "Hello▁World▁this▁is▁a▁test▁sentence"
    chunks = chunker.chunk(text)

    for chunk in chunks:
        print(f"Chunk: {chunk.text}")
    ```
  </Accordion>

  <Accordion title="Handling Consecutive Delimiters">
    ```python theme={"system"}
    from chonkie import FastChunker

    # Split at START of consecutive whitespace runs
    chunker = FastChunker(
        chunk_size=10,
        pattern=" ",
        consecutive=True,
    )

    text = """First           paragraph with multiple sentences.
    This is still the first paragraph.

    Second paragraph starts here.
    More content         in the second paragraph."""  # Multiple spaces between words
    chunks = chunker.chunk(text)

    # Without consecutive=True: might split in middle of "   "
    # With consecutive=True: splits at START of "   "
    for chunk in chunks:
        print(f"Chunk: '{chunk.text}'")
    ```
  </Accordion>

  <Accordion title="Forward Fallback Search">
    ```python theme={"system"}
    from chonkie import FastChunker

    # Search forward if no delimiter found in backward window
    chunker = FastChunker(
        chunk_size=10,
        pattern=" ",
        forward_fallback=True,
    )

    text = "verylongword short"
    chunks = chunker.chunk(text)

    # Without forward_fallback: hard split at byte 10
    # With forward_fallback: finds space after "verylongword"
    for chunk in chunks:
        print(f"Chunk: '{chunk.text}'")
    ```
  </Accordion>

  <Accordion title="Batch Processing">
    ```python theme={"system"}
    from chonkie import FastChunker

    chunker = FastChunker(chunk_size=2048)

    documents = [
        "First document content here...",
        "Second document with different content...",
        "Third document for processing...",
    ]

    # Process all documents
    batch_results = chunker.chunk_batch(documents)

    for doc_idx, doc_chunks in enumerate(batch_results):
        print(f"\nDocument {doc_idx + 1}: {len(doc_chunks)} chunks")
        for chunk in doc_chunks:
            print(f"  - {chunk.text[:30]}... ({len(chunk.text)} bytes)")
    ```
  </Accordion>

  <Accordion title="High-Throughput Pipeline">
    ```python theme={"system"}
    from chonkie import FastChunker
    import time

    # Configure for maximum throughput
    chunker = FastChunker(
        chunk_size=8192,
        delimiters="\n",
    )

    # Read a large file
    with open("large_file.txt", "r") as f:
        large_text = f.read()

    # Benchmark chunking speed
    start = time.perf_counter()
    chunks = chunker.chunk(large_text)
    elapsed = time.perf_counter() - start

    mb_size = len(large_text) / (1024 * 1024)
    throughput = mb_size / elapsed

    print(f"Processed {mb_size:.1f} MB in {elapsed*1000:.1f}ms")
    print(f"Throughput: {throughput:.1f} MB/s")
    print(f"Chunks: {len(chunks)}")
    ```
  </Accordion>
</AccordionGroup>

## Comparison with Other Chunkers

| Feature            | FastChunker               | TokenChunker           | SentenceChunker     |
| ------------------ | ------------------------- | ---------------------- | ------------------- |
| Size unit          | Bytes                     | Tokens                 | Tokens              |
| Tokenizer required | No                        | Yes                    | Yes                 |
| `token_count`      | Always 0                  | Computed               | Computed            |
| Speed              | \~100+ GB/s               | Tokenizer-bound        | Tokenizer-bound     |
| Best for           | High-throughput pipelines | Token-precise chunking | Semantic boundaries |

## When to Use FastChunker

**Use FastChunker when:**

* Processing large volumes of text (>100KB documents)
* Building high-throughput pipelines
* Byte-level precision is acceptable
* You don't need exact token counts

**Use other chunkers when:**

* You need precise token counts for LLM context limits
* Working with small documents (\< 1KB)
* Complex semantic boundaries are required

## Return Type

FastChunker returns chunks as `Chunk` objects:

```python theme={"system"}
@dataclass
class Chunk:
    text: str                                           # The chunk text
    start_index: int                                    # Starting character position in original text
    end_index: int                                      # Ending character position in original text
    token_count: int                                    # Always 0 (not computed for speed)
    context: Optional[str] = None                       # Optional overlap context text
    embedding: Union[list[float], "np.ndarray", None] = None  # Optional embedding vector
```

<Note>
  The `token_count` field is always 0 in FastChunker output.
  If you need token counts, use the tokenizer separately or choose a different chunker.
</Note>
