Skip to main content
Chonkie’s Pipeline API provides a fluent, chainable interface for building text processing workflows. Pipelines follow the CHOMP architecture, automatically orchestrating components in the correct order.

What is CHOMP?

CHOMP (CHOnkie’s Multi-step Pipeline) is our standardized architecture for document processing:
Fetcher → Chef → Chunker → Refinery → Porter/Handshake
1

Fetcher

Retrieve raw data from files, APIs, or databases
2

Chef

Preprocess and transform raw data into Documents
3

Chunker

Split documents into manageable chunks
4

Refinery (Optional)

Post-process and enhance chunks
5

Porter/Handshake (Optional)

Export or store chunks
Pipelines automatically reorder components to follow CHOMP, so you can add them in any order.

Quick Start

Single File Processing

from chonkie.pipeline import Pipeline

# Build and execute pipeline
doc = (Pipeline()
    .fetch_from("file", path="document.txt")
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Access chunks
print(f"Created {len(doc.chunks)} chunks")
for chunk in doc.chunks:
    print(f"Chunk: {chunk.text[:50]}...")

Directory Processing

Process multiple files at once:
# Process all markdown files in a directory
docs = (Pipeline()
    .fetch_from("file", dir="./documents", ext=[".md", ".txt"])
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Process each document
for doc in docs:
    print(f"Document has {len(doc.chunks)} chunks")

Direct Text Input

Skip the fetcher and provide text directly:
# No fetcher needed
doc = (Pipeline()
    .process_with("text")
    .chunk_with("semantic", threshold=0.8)
    .run(texts="Your text here"))

# Multiple texts
docs = (Pipeline()
    .chunk_with("recursive", chunk_size=512)
    .run(texts=["Text 1", "Text 2", "Text 3"]))

Pipeline Methods

fetch_from()

Fetch data from a source:
# Single file
.fetch_from("file", path="document.txt")

# Directory with extension filter
.fetch_from("file", dir="./docs", ext=[".txt", ".md"])

process_with()

Process data with a chef:
# Text processing
.process_with("text")

# Markdown processing
.process_with("markdown")

# Table processing
.process_with("table")

chunk_with()

Chunk documents (required):
# Recursive chunking
.chunk_with("recursive", chunk_size=512, chunk_overlap=50)

# Semantic chunking
.chunk_with("semantic", threshold=0.8, chunk_size=1024)

# Code chunking
.chunk_with("code", chunk_size=512)

refine_with()

Refine chunks (optional, can chain multiple):
# Add overlap context
.refine_with("overlap", context_size=100, method="prefix")

# Add embeddings
.refine_with("embedding", model="text-embedding-3-small")

export_with()

Export chunks to formats (optional):
# Export to JSON
.export_with("json", file="chunks.json")

# Export to Hugging Face Datasets
.export_with("datasets", name="my-dataset")

store_in()

Store in vector databases (optional):
# Store in Chroma
.store_in("chroma", collection_name="documents")

# Store in Qdrant
.store_in("qdrant", collection_name="docs", url="http://localhost:6333")

Advanced Examples

RAG Knowledge Base

Build a complete RAG ingestion pipeline:
# Ingest documents into vector database
docs = (Pipeline()
    .fetch_from("file", dir="./knowledge_base", ext=[".txt", ".md"])
    .process_with("text")
    .chunk_with("semantic", threshold=0.8, chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant",
              collection_name="knowledge",
              url="http://localhost:6333")
    .run())

print(f"Ingested {len(docs)} documents")

Semantic Search Pipeline

Process documents with embeddings for search:
# Chunk with embeddings
doc = (Pipeline()
    .fetch_from("file", path="research_paper.txt")
    .process_with("text")
    .chunk_with("semantic",
                threshold=0.8,
                chunk_size=1024,
                similarity_window=3)
    .refine_with("overlap", context_size=100)
    .refine_with("embedding", model="minishlab/potion-base-32M")
    .run())

# All chunks now have embeddings
for chunk in doc.chunks:
    if chunk.embedding is not None:
        print(f"Chunk: {chunk.text[:30]}... | Embedding shape: {chunk.embedding.shape}")

Code Documentation

Process code with specialized chunking:
# Chunk Python files
docs = (Pipeline()
    .fetch_from("file", dir="./src", ext=[".py"])
    .chunk_with("code", chunk_size=512)
    .export_with("json", file="code_chunks.json")
    .run())

print(f"Processed {len(docs)} Python files")

Markdown Processing

Handle markdown with table and code awareness:
# Process markdown documentation
doc = (Pipeline()
    .fetch_from("file", path="README.md")
    .process_with("markdown")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Access markdown metadata
print(f"Found {len(doc.tables)} tables")
print(f"Found {len(doc.code)} code blocks")
print(f"Created {len(doc.chunks)} chunks")

Recipe-Based Pipelines

Load pre-configured pipelines from the Chonkie Hub:
# Load markdown processing recipe
pipeline = Pipeline.from_recipe("markdown")

# Run with your content
doc = pipeline.run(texts="# My Markdown\n\nContent here")

# Load custom local recipe
pipeline = Pipeline.from_recipe("custom", path="./my_recipe.json")
Recipes are stored in the chonkie-ai/recipes repository.

Best Practices

Explicitly set chunk_size for predictable behavior:
# Good - explicit size
.chunk_with("recursive", chunk_size=512)

# Avoid - uses defaults that may change
.chunk_with("recursive")
Choose chunkers appropriate for your content:
# Code files → Code chunker
.chunk_with("code")

# Need semantic similarity → Semantic chunker
.chunk_with("semantic", threshold=0.8)

# General text → Recursive chunker
.chunk_with("recursive")
Add overlap refineries for better retrieval context:
.chunk_with("recursive", chunk_size=512)
.refine_with("overlap", context_size=100)
Always specify file extensions to avoid unwanted files:
# Good - filtered
.fetch_from("file", dir="./docs", ext=[".txt", ".md"])

# Bad - processes everything including binaries
.fetch_from("file", dir="./docs")
Multiple refineries can be chained:
.chunk_with("recursive", chunk_size=512)
.refine_with("overlap", context_size=50)
.refine_with("embedding", model="text-embedding-3-small")

Pipeline Validation

Pipelines validate configuration before execution: Must have: At least one chunker ✅ Must have: Fetcher OR text input via run(texts=...)Cannot have: Multiple chefs (only one allowed)
# ❌ Invalid - no chunker
Pipeline().fetch_from("file", path="doc.txt").run()

# ❌ Invalid - multiple chefs
Pipeline()
    .process_with("text")
    .process_with("markdown")  # Error!
    .chunk_with("recursive")

# ✅ Valid - has chunker and input source
Pipeline()
    .fetch_from("file", path="doc.txt")
    .chunk_with("recursive", chunk_size=512)
    .run()

# ✅ Valid - text input, no fetcher needed
Pipeline()
    .chunk_with("recursive")
    .run(texts="Hello world")

Return Values

Pipeline behavior depends on input:
  • Single file/text: Returns Document
  • Multiple files/texts: Returns List[Document]
# Single file → Document
doc = Pipeline().fetch_from("file", path="doc.txt").chunk_with("recursive").run()
assert isinstance(doc, Document)

# Directory → List[Document]
docs = Pipeline().fetch_from("file", dir="./docs").chunk_with("recursive").run()
assert isinstance(docs, list)

# Multiple texts → List[Document]
docs = Pipeline().chunk_with("recursive").run(texts=["t1", "t2"])
assert isinstance(docs, list)

Error Handling

Pipelines provide clear error messages:
from pathlib import Path

try:
    doc = Pipeline()
        .fetch_from("file", path="missing.txt")
        .chunk_with("recursive")
        .run()
except FileNotFoundError as e:
    print(f"File not found: {e}")
except ValueError as e:
    print(f"Configuration error: {e}")
except RuntimeError as e:
    print(f"Pipeline execution failed: {e}")

Component Overview

Available Components

Explore each component type:

What’s Next?

1

Explore Fetchers

Learn how to connect different data sources in Fetchers
2

Choose Your Chunker

Find the right chunking strategy in Chunkers
3

Enhance with Refineries

Improve chunk quality in Refineries
4

Store Your Chunks

Ingest into vector databases with Handshakes
I