Skip to main content
Chonkie’s Pipeline API provides a fluent, chainable interface for building text processing workflows. Pipelines follow the CHOMP architecture, automatically orchestrating components in the correct order.

What is CHOMP?

CHOMP (CHOnkie’s Multi-step Pipeline) is our standardized architecture for document processing:
Fetcher → Chef → Chunker → Refinery → Porter/Handshake
1

Fetcher

Retrieve raw data from files, APIs, or databases
2

Chef

Preprocess and transform raw data into Documents
3

Chunker

Split documents into manageable chunks
4

Refinery (Optional)

Post-process and enhance chunks
5

Porter/Handshake (Optional)

Export or store chunks
Pipelines automatically reorder components to follow CHOMP, so you can add them in any order.

Quick Start

Single File Processing

from chonkie import Pipeline

# Build and execute pipeline
doc = (Pipeline()
    .fetch_from("file", path="document.txt")
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Access chunks
print(f"Created {len(doc.chunks)} chunks")
for chunk in doc.chunks:
    print(f"Chunk: {chunk.text[:50]}...")

Directory Processing

Process multiple files at once:
# Process all markdown files in a directory
docs = (Pipeline()
    .fetch_from("file", dir="./documents", ext=[".md", ".txt"])
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Process each document
for doc in docs:
    print(f"Document has {len(doc.chunks)} chunks")

Direct Text Input

Skip the fetcher and provide text directly:
# No fetcher needed
doc = (Pipeline()
    .process_with("text")
    .chunk_with("semantic", threshold=0.8)
    .run(texts="Your text here"))

# Multiple texts
docs = (Pipeline()
    .chunk_with("recursive", chunk_size=512)
    .run(texts=["Text 1", "Text 2", "Text 3"]))

Asynchronous Execution

For high-throughput applications (e.g., web servers, batch processing), use arun():
import asyncio

async def process_docs():
    pipe = Pipeline().chunk_with("recursive")

    # Run pipeline asynchronously
    doc = await pipe.arun(texts="Async processing is fast!")

    # Process multiple concurrently
    docs = await pipe.arun(texts=["Doc 1", "Doc 2"])

    return docs

Pipeline Methods

fetch_from()

Fetch data from a source:
# Single file
.fetch_from("file", path="document.txt")

# Directory with extension filter
.fetch_from("file", dir="./docs", ext=[".txt", ".md"])

process_with()

Process data with a chef:
# Text processing
.process_with("text")

# Markdown processing
.process_with("markdown")

# Table processing
.process_with("table")

chunk_with()

Chunk documents (required):
# Recursive chunking
.chunk_with("recursive", chunk_size=512, chunk_overlap=50)

# Semantic chunking
.chunk_with("semantic", threshold=0.8, chunk_size=1024)

# Code chunking
.chunk_with("code", chunk_size=512)

refine_with()

Refine chunks (optional, can chain multiple):
# Add overlap context
.refine_with("overlap", context_size=100, method="prefix")

# Add embeddings
.refine_with("embedding", model="text-embedding-3-small")

export_with()

Export chunks to formats (optional):
# Export to JSON
.export_with("json", file="chunks.json")

# Export to Hugging Face Datasets
.export_with("datasets", name="my-dataset")

store_in()

Store in vector databases (optional):
# Store in Chroma
.store_in("chroma", collection_name="documents")

# Store in Qdrant
.store_in("qdrant", collection_name="docs", url="http://localhost:6333")

Advanced Examples

RAG Knowledge Base

Build a complete RAG ingestion pipeline:
# Ingest documents into vector database
docs = (Pipeline()
    .fetch_from("file", dir="./knowledge_base", ext=[".txt", ".md"])
    .process_with("text")
    .chunk_with("semantic", threshold=0.8, chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant",
              collection_name="knowledge",
              url="http://localhost:6333")
    .run())

print(f"Ingested {len(docs)} documents")

Semantic Search Pipeline

Process documents with embeddings for search:
# Chunk with embeddings
doc = (Pipeline()
    .fetch_from("file", path="research_paper.txt")
    .process_with("text")
    .chunk_with("semantic",
                threshold=0.8,
                chunk_size=1024,
                similarity_window=3)
    .refine_with("overlap", context_size=100)
    .refine_with("embedding", model="minishlab/potion-base-32M")
    .run())

# All chunks now have embeddings
for chunk in doc.chunks:
    if chunk.embedding is not None:
        print(f"Chunk: {chunk.text[:30]}... | Embedding shape: {chunk.embedding.shape}")

Code Documentation

Process code with specialized chunking:
# Chunk Python files
docs = (Pipeline()
    .fetch_from("file", dir="./src", ext=[".py"])
    .chunk_with("code", chunk_size=512)
    .export_with("json", file="code_chunks.json")
    .run())

print(f"Processed {len(docs)} Python files")

Markdown Processing

Handle markdown with table and code awareness:
# Process markdown documentation
doc = (Pipeline()
    .fetch_from("file", path="README.md")
    .process_with("markdown")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Access markdown metadata
print(f"Found {len(doc.tables)} tables")
print(f"Found {len(doc.code)} code blocks")
print(f"Created {len(doc.chunks)} chunks")

Recipe-Based Pipelines

Load pre-configured pipelines from the Chonkie Hub:
# Load markdown processing recipe
pipeline = Pipeline.from_recipe("markdown")

# Run with your content
doc = pipeline.run(texts="# My Markdown\n\nContent here")

# Load custom local recipe
pipeline = Pipeline.from_recipe("custom", path="./my_recipe.json")
Recipes are stored in the chonkie-ai/recipes repository.

Best Practices

Explicitly set chunk_size for predictable behavior:
# Good - explicit size
.chunk_with("recursive", chunk_size=512)

# Avoid - uses defaults that may change
.chunk_with("recursive")
Choose chunkers appropriate for your content:
# Code files → Code chunker
.chunk_with("code")

# Need semantic similarity → Semantic chunker
.chunk_with("semantic", threshold=0.8)

# General text → Recursive chunker
.chunk_with("recursive")
Add overlap refineries for better retrieval context:
.chunk_with("recursive", chunk_size=512)
.refine_with("overlap", context_size=100)
Always specify file extensions to avoid unwanted files:
# Good - filtered
.fetch_from("file", dir="./docs", ext=[".txt", ".md"])

# Bad - processes everything including binaries
.fetch_from("file", dir="./docs")
Multiple refineries can be chained:
.chunk_with("recursive", chunk_size=512)
.refine_with("overlap", context_size=50)
.refine_with("embedding", model="text-embedding-3-small")

Pipeline Validation

Pipelines validate configuration before execution: Must have: At least one chunker ✅ Must have: Fetcher OR text input via run(texts=...)Cannot have: Multiple chefs (only one allowed)
# ❌ Invalid - no chunker
Pipeline().fetch_from("file", path="doc.txt").run()

# ❌ Invalid - multiple chefs
Pipeline()
    .process_with("text")
    .process_with("markdown")  # Error!
    .chunk_with("recursive")

# ✅ Valid - has chunker and input source
Pipeline()
    .fetch_from("file", path="doc.txt")
    .chunk_with("recursive", chunk_size=512)
    .run()

# ✅ Valid - text input, no fetcher needed
Pipeline()
    .chunk_with("recursive")
    .run(texts="Hello world")

Return Values

Pipeline behavior depends on input:
  • Single file/text: Returns Document
  • Multiple files/texts: Returns list[Document]
# Single file → Document
doc = Pipeline().fetch_from("file", path="doc.txt").chunk_with("recursive").run()
assert isinstance(doc, Document)

# Directory → list[Document]
docs = Pipeline().fetch_from("file", dir="./docs").chunk_with("recursive").run()
assert isinstance(docs, list)

# Multiple texts → list[Document]
docs = Pipeline().chunk_with("recursive").run(texts=["t1", "t2"])
assert isinstance(docs, list)

Error Handling

Pipelines provide clear error messages:
from pathlib import Path

try:
    doc = Pipeline()
        .fetch_from("file", path="missing.txt")
        .chunk_with("recursive")
        .run()
except FileNotFoundError as e:
    print(f"File not found: {e}")
except ValueError as e:
    print(f"Configuration error: {e}")
except RuntimeError as e:
    print(f"Pipeline execution failed: {e}")

Component Overview

Available Components

Explore each component type:

Fetchers

Connect to data sources (files, APIs, databases)

Chefs

Preprocess text, markdown, tables, etc.

Chunkers

Split text with various strategies

Refineries

Add overlap, embeddings, and more

Porters

Export to JSON, Datasets, etc.

Handshakes

Store in Chroma, Qdrant, Pinecone, etc.

What’s Next?

1

Explore Fetchers

Learn how to connect different data sources in Fetchers
2

Choose Your Chunker

Find the right chunking strategy in Chunkers
3

Enhance with Refineries

Improve chunk quality in Refineries
4

Store Your Chunks

Ingest into vector databases with Handshakes