Chonkie Documentation

Chonkie’s Pipeline API provides a fluent, chainable interface for building text processing workflows. Pipelines follow the CHOMP architecture, automatically orchestrating components in the correct order.

What is CHOMP?

CHOMP (CHOnkie’s Multi-step Pipeline) is our standardized architecture for document processing:

Fetcher → Chef → Chunker → Refinery → Porter/Handshake

Fetcher

Retrieve raw data from files, APIs, or databases

Chef

Preprocess and transform raw data into Documents

Chunker

Split documents into manageable chunks

Refinery (Optional)

Post-process and enhance chunks

Porter/Handshake (Optional)

Export or store chunks

Pipelines automatically reorder components to follow CHOMP, so you can add them in any order.

Quick Start

Single File Processing

from chonkie import Pipeline

# Build and execute pipeline
doc = (Pipeline()
    .fetch_from("file", path="document.txt")
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Access chunks
print(f"Created {len(doc.chunks)} chunks")
for chunk in doc.chunks:
    print(f"Chunk: {chunk.text[:50]}...")

Directory Processing

Process multiple files at once:

# Process all markdown files in a directory
docs = (Pipeline()
    .fetch_from("file", dir="./documents", ext=[".md", ".txt"])
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Process each document
for doc in docs:
    print(f"Document has {len(doc.chunks)} chunks")

Direct Text Input

Skip the fetcher and provide text directly:

# No fetcher needed
doc = (Pipeline()
    .process_with("text")
    .chunk_with("semantic", threshold=0.8)
    .run(texts="Your text here"))

# Multiple texts
docs = (Pipeline()
    .chunk_with("recursive", chunk_size=512)
    .run(texts=["Text 1", "Text 2", "Text 3"]))

Pipeline Methods

fetch_from()

Fetch data from a source:

# Single file
.fetch_from("file", path="document.txt")

# Directory with extension filter
.fetch_from("file", dir="./docs", ext=[".txt", ".md"])

process_with()

Process data with a chef:

# Text processing
.process_with("text")

# Markdown processing
.process_with("markdown")

# Table processing
.process_with("table")

chunk_with()

Chunk documents (required):

# Recursive chunking
.chunk_with("recursive", chunk_size=512, chunk_overlap=50)

# Semantic chunking
.chunk_with("semantic", threshold=0.8, chunk_size=1024)

# Code chunking
.chunk_with("code", chunk_size=512)

refine_with()

Refine chunks (optional, can chain multiple):

# Add overlap context
.refine_with("overlap", context_size=100, method="prefix")

# Add embeddings
.refine_with("embedding", model="text-embedding-3-small")

export_with()

Export chunks to formats (optional):

# Export to JSON
.export_with("json", file="chunks.json")

# Export to Hugging Face Datasets
.export_with("datasets", name="my-dataset")

store_in()

Store in vector databases (optional):

# Store in Chroma
.store_in("chroma", collection_name="documents")

# Store in Qdrant
.store_in("qdrant", collection_name="docs", url="http://localhost:6333")

Advanced Examples

RAG Knowledge Base

Build a complete RAG ingestion pipeline:

# Ingest documents into vector database
docs = (Pipeline()
    .fetch_from("file", dir="./knowledge_base", ext=[".txt", ".md"])
    .process_with("text")
    .chunk_with("semantic", threshold=0.8, chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant",
              collection_name="knowledge",
              url="http://localhost:6333")
    .run())

print(f"Ingested {len(docs)} documents")

Semantic Search Pipeline

Process documents with embeddings for search:

# Chunk with embeddings
doc = (Pipeline()
    .fetch_from("file", path="research_paper.txt")
    .process_with("text")
    .chunk_with("semantic",
                threshold=0.8,
                chunk_size=1024,
                similarity_window=3)
    .refine_with("overlap", context_size=100)
    .refine_with("embedding", model="minishlab/potion-base-32M")
    .run())

# All chunks now have embeddings
for chunk in doc.chunks:
    if chunk.embedding is not None:
        print(f"Chunk: {chunk.text[:30]}... | Embedding shape: {chunk.embedding.shape}")

Code Documentation

Process code with specialized chunking:

# Chunk Python files
docs = (Pipeline()
    .fetch_from("file", dir="./src", ext=[".py"])
    .chunk_with("code", chunk_size=512)
    .export_with("json", file="code_chunks.json")
    .run())

print(f"Processed {len(docs)} Python files")

Markdown Processing

Handle markdown with table and code awareness:

# Process markdown documentation
doc = (Pipeline()
    .fetch_from("file", path="README.md")
    .process_with("markdown")
    .chunk_with("recursive", chunk_size=512)
    .run())

# Access markdown metadata
print(f"Found {len(doc.tables)} tables")
print(f"Found {len(doc.code)} code blocks")
print(f"Created {len(doc.chunks)} chunks")

Recipe-Based Pipelines

Load pre-configured pipelines from the Chonkie Hub:

# Load markdown processing recipe
pipeline = Pipeline.from_recipe("markdown")

# Run with your content
doc = pipeline.run(texts="# My Markdown\n\nContent here")

# Load custom local recipe
pipeline = Pipeline.from_recipe("custom", path="./my_recipe.json")

Recipes are stored in the chonkie-ai/recipes repository.

Best Practices

Always specify chunk_size

Explicitly set chunk_size for predictable behavior:

# Good - explicit size
.chunk_with("recursive", chunk_size=512)

# Avoid - uses defaults that may change
.chunk_with("recursive")

Match chunkers to content type

Choose chunkers appropriate for your content:

# Code files → Code chunker
.chunk_with("code")

# Need semantic similarity → Semantic chunker
.chunk_with("semantic", threshold=0.8)

# General text → Recursive chunker
.chunk_with("recursive")

Use refineries for RAG applications

Add overlap refineries for better retrieval context:

.chunk_with("recursive", chunk_size=512)
.refine_with("overlap", context_size=100)

Filter extensions in directory mode

Always specify file extensions to avoid unwanted files:

# Good - filtered
.fetch_from("file", dir="./docs", ext=[".txt", ".md"])

# Bad - processes everything including binaries
.fetch_from("file", dir="./docs")

Chain refineries for complex processing

Multiple refineries can be chained:

.chunk_with("recursive", chunk_size=512)
.refine_with("overlap", context_size=50)
.refine_with("embedding", model="text-embedding-3-small")

Pipeline Validation

Pipelines validate configuration before execution: ✅ Must have: At least one chunker ✅ Must have: Fetcher OR text input via run(texts=...) ❌ Cannot have: Multiple chefs (only one allowed)

# ❌ Invalid - no chunker
Pipeline().fetch_from("file", path="doc.txt").run()

# ❌ Invalid - multiple chefs
Pipeline()
    .process_with("text")
    .process_with("markdown")  # Error!
    .chunk_with("recursive")

# ✅ Valid - has chunker and input source
Pipeline()
    .fetch_from("file", path="doc.txt")
    .chunk_with("recursive", chunk_size=512)
    .run()

# ✅ Valid - text input, no fetcher needed
Pipeline()
    .chunk_with("recursive")
    .run(texts="Hello world")

Return Values

Pipeline behavior depends on input:

Single file/text: Returns Document
Multiple files/texts: Returns List[Document]

# Single file → Document
doc = Pipeline().fetch_from("file", path="doc.txt").chunk_with("recursive").run()
assert isinstance(doc, Document)

# Directory → List[Document]
docs = Pipeline().fetch_from("file", dir="./docs").chunk_with("recursive").run()
assert isinstance(docs, list)

# Multiple texts → List[Document]
docs = Pipeline().chunk_with("recursive").run(texts=["t1", "t2"])
assert isinstance(docs, list)

Error Handling

Pipelines provide clear error messages:

from pathlib import Path

try:
    doc = Pipeline()
        .fetch_from("file", path="missing.txt")
        .chunk_with("recursive")
        .run()
except FileNotFoundError as e:
    print(f"File not found: {e}")
except ValueError as e:
    print(f"Configuration error: {e}")
except RuntimeError as e:
    print(f"Pipeline execution failed: {e}")

Component Overview

Available Components

Explore each component type:

Fetchers

Connect to data sources (files, APIs, databases)

Chefs

Preprocess text, markdown, tables, etc.

Chunkers

Split text with various strategies

Refineries

Add overlap, embeddings, and more

Porters

Export to JSON, Datasets, etc.

Handshakes

Store in Chroma, Qdrant, Pinecone, etc.

What’s Next?

Explore Fetchers

Learn how to connect different data sources in Fetchers

Choose Your Chunker

Find the right chunking strategy in Chunkers

Enhance with Refineries

Improve chunk quality in Refineries

Store Your Chunks

Ingest into vector databases with Handshakes

Getting Started

Chefs

Fetchers

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Deprecated

Changelog

​What is CHOMP?

​Quick Start

​Single File Processing

​Directory Processing

​Direct Text Input

​Pipeline Methods

​fetch_from()

​process_with()

​chunk_with()

​refine_with()

​export_with()

​store_in()

​Advanced Examples

​RAG Knowledge Base

​Semantic Search Pipeline

​Code Documentation

​Markdown Processing

​Recipe-Based Pipelines

​Best Practices

​Pipeline Validation

​Return Values

​Error Handling

​Component Overview

​Available Components

Fetchers

Chefs

Chunkers

Refineries

Porters

Handshakes

​What’s Next?

What is CHOMP?

Quick Start

Single File Processing

Directory Processing

Direct Text Input

Pipeline Methods

fetch_from()

process_with()

chunk_with()

refine_with()

export_with()

store_in()

Advanced Examples

RAG Knowledge Base

Semantic Search Pipeline

Code Documentation

Markdown Processing

Recipe-Based Pipelines

Best Practices

Pipeline Validation

Return Values

Error Handling

Component Overview

Available Components

What’s Next?