> ## Documentation Index
> Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Building Pipelines

> Build powerful text processing workflows with Chonkie's Pipeline API

Chonkie's Pipeline API provides a fluent, chainable interface for building text processing workflows. Pipelines follow the **CHOMP architecture**, automatically orchestrating components in the correct order.

## What is CHOMP?

CHOMP (CHOnkie's Multi-step Pipeline) is our standardized architecture for document processing:

```
Fetcher → Chef → Chunker → Refinery → Porter/Handshake
```

<Steps>
  <Step title="Fetcher">
    Retrieve raw data from files, APIs, or databases
  </Step>

  <Step title="Chef">
    Preprocess and transform raw data into Documents
  </Step>

  <Step title="Chunker">
    Split documents into manageable chunks
  </Step>

  <Step title="Refinery (Optional)">
    Post-process and enhance chunks
  </Step>

  <Step title="Porter/Handshake (Optional)">
    Export or store chunks
  </Step>
</Steps>

<Info>
  Pipelines automatically reorder components to follow CHOMP, so you can add them in any order.
</Info>

## Quick Start

### Single File Processing

```python theme={"system"}
from chonkie import Pipeline

# Build and execute pipeline
doc = (Pipeline()
    .fetch_from("file", path="document.txt")
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Access chunks
print(f"Created {len(doc.chunks)} chunks")
for chunk in doc.chunks:
    print(f"Chunk: {chunk.text[:50]}...")
```

### Directory Processing

Process multiple files at once:

```python theme={"system"}
# Process all markdown files in a directory
docs = (Pipeline()
    .fetch_from("file", dir="./documents", ext=[".md", ".txt"])
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Process each document
for doc in docs:
    print(f"Document has {len(doc.chunks)} chunks")
```

### Direct Text Input

Skip the fetcher and provide text directly:

```python theme={"system"}
# No fetcher needed
doc = (Pipeline()
    .process_with("text")
    .chunk_with("semantic", threshold=0.8)
    .run(texts="Your text here"))

# Multiple texts
docs = (Pipeline()
    .chunk_with("recursive", chunk_size=512)
    .run(texts=["Text 1", "Text 2", "Text 3"]))
```

### Asynchronous Execution

For high-throughput applications (e.g., web servers, batch processing), use `arun()`:

```python theme={"system"}
import asyncio

async def process_docs():
    pipe = Pipeline().chunk_with("recursive")

    # Run pipeline asynchronously
    doc = await pipe.arun(texts="Async processing is fast!")

    # Process multiple concurrently
    docs = await pipe.arun(texts=["Doc 1", "Doc 2"])

    return docs
```

## Pipeline Methods

### fetch\_from()

Fetch data from a source:

```python theme={"system"}
# Single file
.fetch_from("file", path="document.txt")

# Directory with extension filter
.fetch_from("file", dir="./docs", ext=[".txt", ".md"])
```

### process\_with()

Process data with a chef:

```python theme={"system"}
# Text processing
.process_with("text")

# Markdown processing
.process_with("markdown")

# Table processing
.process_with("table")
```

### chunk\_with()

Chunk documents (required):

```python theme={"system"}
# Recursive chunking
.chunk_with("recursive", chunk_size=512, chunk_overlap=50)

# Semantic chunking
.chunk_with("semantic", threshold=0.8, chunk_size=1024)

# Code chunking
.chunk_with("code", chunk_size=512)
```

### refine\_with()

Refine chunks (optional, can chain multiple):

```python theme={"system"}
# Add overlap context
.refine_with("overlap", context_size=100, method="prefix")

# Add embeddings
.refine_with("embedding", model="text-embedding-3-small")
```

### export\_with()

Export chunks to formats (optional):

```python theme={"system"}
# Export to JSON
.export_with("json", file="chunks.json")

# Export to Hugging Face Datasets
.export_with("datasets", name="my-dataset")
```

### store\_in()

Store in vector databases (optional):

```python theme={"system"}
# Store in Chroma
.store_in("chroma", collection_name="documents")

# Store in Qdrant
.store_in("qdrant", collection_name="docs", url="http://localhost:6333")
```

## Advanced Examples

### RAG Knowledge Base

Build a complete RAG ingestion pipeline:

```python theme={"system"}
# Ingest documents into vector database
docs = (Pipeline()
    .fetch_from("file", dir="./knowledge_base", ext=[".txt", ".md"])
    .process_with("text")
    .chunk_with("semantic", threshold=0.8, chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant",
              collection_name="knowledge",
              url="http://localhost:6333")
    .run())

print(f"Ingested {len(docs)} documents")
```

### Semantic Search Pipeline

Process documents with embeddings for search:

```python theme={"system"}
# Chunk with embeddings
doc = (Pipeline()
    .fetch_from("file", path="research_paper.txt")
    .process_with("text")
    .chunk_with("semantic",
                threshold=0.8,
                chunk_size=1024,
                similarity_window=3)
    .refine_with("overlap", context_size=100)
    .refine_with("embedding", model="minishlab/potion-base-32M")
    .run())

# All chunks now have embeddings
for chunk in doc.chunks:
    if chunk.embedding is not None:
        print(f"Chunk: {chunk.text[:30]}... | Embedding shape: {chunk.embedding.shape}")
```

### Code Documentation

Process code with specialized chunking:

```python theme={"system"}
# Chunk Python files
docs = (Pipeline()
    .fetch_from("file", dir="./src", ext=[".py"])
    .chunk_with("code", chunk_size=512)
    .export_with("json", file="code_chunks.json")
    .run())

print(f"Processed {len(docs)} Python files")
```

### Markdown Processing

Handle markdown with table and code awareness:

```python theme={"system"}
# Process markdown documentation
doc = (Pipeline()
    .fetch_from("file", path="README.md")
    .process_with("markdown")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Access markdown metadata
print(f"Found {len(doc.tables)} tables")
print(f"Found {len(doc.code)} code blocks")
print(f"Created {len(doc.chunks)} chunks")
```

## Recipe-Based Pipelines

Load pre-configured pipelines from the Chonkie Hub:

```python theme={"system"}
# Load markdown processing recipe
pipeline = Pipeline.from_recipe("markdown")

# Run with your content
doc = pipeline.run(texts="# My Markdown\n\nContent here")

# Load custom local recipe
pipeline = Pipeline.from_recipe("custom", path="./my_recipe.json")
```

<Note>
  Recipes are stored in the [chonkie-ai/recipes](https://huggingface.co/datasets/chonkie-ai/recipes) repository.
</Note>

## Best Practices

<AccordionGroup>
  <Accordion title="Always specify chunk_size">
    Explicitly set `chunk_size` for predictable behavior:

    ```python theme={"system"}
    # Good - explicit size
    .chunk_with("recursive", chunk_size=512)

    # Avoid - uses defaults that may change
    .chunk_with("recursive")
    ```
  </Accordion>

  <Accordion title="Match chunkers to content type">
    Choose chunkers appropriate for your content:

    ```python theme={"system"}
    # Code files → Code chunker
    .chunk_with("code")

    # Need semantic similarity → Semantic chunker
    .chunk_with("semantic", threshold=0.8)

    # General text → Recursive chunker
    .chunk_with("recursive")
    ```
  </Accordion>

  <Accordion title="Use refineries for RAG applications">
    Add overlap refineries for better retrieval context:

    ```python theme={"system"}
    .chunk_with("recursive", chunk_size=512)
    .refine_with("overlap", context_size=100)
    ```
  </Accordion>

  <Accordion title="Filter extensions in directory mode">
    Always specify file extensions to avoid unwanted files:

    ```python theme={"system"}
    # Good - filtered
    .fetch_from("file", dir="./docs", ext=[".txt", ".md"])

    # Bad - processes everything including binaries
    .fetch_from("file", dir="./docs")
    ```
  </Accordion>

  <Accordion title="Chain refineries for complex processing">
    Multiple refineries can be chained:

    ```python theme={"system"}
    .chunk_with("recursive", chunk_size=512)
    .refine_with("overlap", context_size=50)
    .refine_with("embedding", model="text-embedding-3-small")
    ```
  </Accordion>
</AccordionGroup>

## Pipeline Validation

Pipelines validate configuration before execution:

✅ **Must have**: At least one chunker
✅ **Must have**: Fetcher OR text input via `run(texts=...)`
❌ **Cannot have**: Multiple chefs (only one allowed)

```python theme={"system"}
# ❌ Invalid - no chunker
Pipeline().fetch_from("file", path="doc.txt").run()

# ❌ Invalid - multiple chefs
Pipeline()
    .process_with("text")
    .process_with("markdown")  # Error!
    .chunk_with("recursive")

# ✅ Valid - has chunker and input source
Pipeline()
    .fetch_from("file", path="doc.txt")
    .chunk_with("recursive", chunk_size=512)
    .run()

# ✅ Valid - text input, no fetcher needed
Pipeline()
    .chunk_with("recursive")
    .run(texts="Hello world")
```

## Return Values

Pipeline behavior depends on input:

* **Single file/text**: Returns `Document`
* **Multiple files/texts**: Returns `list[Document]`

```python theme={"system"}
# Single file → Document
doc = Pipeline().fetch_from("file", path="doc.txt").chunk_with("recursive").run()
assert isinstance(doc, Document)

# Directory → list[Document]
docs = Pipeline().fetch_from("file", dir="./docs").chunk_with("recursive").run()
assert isinstance(docs, list)

# Multiple texts → list[Document]
docs = Pipeline().chunk_with("recursive").run(texts=["t1", "t2"])
assert isinstance(docs, list)
```

## Error Handling

Pipelines provide clear error messages:

```python theme={"system"}
from pathlib import Path

try:
    doc = Pipeline()
        .fetch_from("file", path="missing.txt")
        .chunk_with("recursive")
        .run()
except FileNotFoundError as e:
    print(f"File not found: {e}")
except ValueError as e:
    print(f"Configuration error: {e}")
except RuntimeError as e:
    print(f"Pipeline execution failed: {e}")
```

## Component Overview

### Available Components

Explore each component type:

<CardGroup cols={2}>
  <Card title="Fetchers" icon="download" href="/oss/fetchers/overview">
    Connect to data sources (files, APIs, databases)
  </Card>

  <Card title="Chefs" icon="hat-chef" href="/oss/chefs/overview">
    Preprocess text, markdown, tables, etc.
  </Card>

  <Card title="Chunkers" icon="scissors" href="/oss/chunkers/overview">
    Split text with various strategies
  </Card>

  <Card title="Refineries" icon="wand-magic-sparkles" href="/oss/refinery/overview">
    Add overlap, embeddings, and more
  </Card>

  <Card title="Porters" icon="file-export" href="/oss/porters/overview">
    Export to JSON, Datasets, etc.
  </Card>

  <Card title="Handshakes" icon="database" href="/oss/handshakes/overview">
    Store in Chroma, Qdrant, Pinecone, etc.
  </Card>
</CardGroup>

## What's Next?

<Steps>
  <Step title="Explore Fetchers">
    Learn how to connect different data sources in [Fetchers](/oss/fetchers/overview)
  </Step>

  <Step title="Choose Your Chunker">
    Find the right chunking strategy in [Chunkers](/oss/chunkers/overview)
  </Step>

  <Step title="Enhance with Refineries">
    Improve chunk quality in [Refineries](/oss/refinery/overview)
  </Step>

  <Step title="Store Your Chunks">
    Ingest into vector databases with [Handshakes](/oss/handshakes/overview)
  </Step>
</Steps>
