> ## Documentation Index > Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt > Use this file to discover all available pages before exploring further. # Building Pipelines > Build powerful text processing workflows with Chonkie's Pipeline API Chonkie's Pipeline API provides a fluent, chainable interface for building text processing workflows. Pipelines follow the **CHOMP architecture**, automatically orchestrating components in the correct order. ## What is CHOMP? CHOMP (CHOnkie's Multi-step Pipeline) is our standardized architecture for document processing: ``` Fetcher → Chef → Chunker → Refinery → Porter/Handshake ``` Retrieve raw data from files, APIs, or databases Preprocess and transform raw data into Documents Split documents into manageable chunks Post-process and enhance chunks Export or store chunks Pipelines automatically reorder components to follow CHOMP, so you can add them in any order. ## Quick Start ### Single File Processing ```python theme={"system"} from chonkie import Pipeline # Build and execute pipeline doc = (Pipeline() .fetch_from("file", path="document.txt") .process_with("text") .chunk_with("recursive", chunk_size=512) .run()) # Access chunks print(f"Created {len(doc.chunks)} chunks") for chunk in doc.chunks: print(f"Chunk: {chunk.text[:50]}...") ``` ### Directory Processing Process multiple files at once: ```python theme={"system"} # Process all markdown files in a directory docs = (Pipeline() .fetch_from("file", dir="./documents", ext=[".md", ".txt"]) .process_with("text") .chunk_with("recursive", chunk_size=512) .run()) # Process each document for doc in docs: print(f"Document has {len(doc.chunks)} chunks") ``` ### Direct Text Input Skip the fetcher and provide text directly: ```python theme={"system"} # No fetcher needed doc = (Pipeline() .process_with("text") .chunk_with("semantic", threshold=0.8) .run(texts="Your text here")) # Multiple texts docs = (Pipeline() .chunk_with("recursive", chunk_size=512) .run(texts=["Text 1", "Text 2", "Text 3"])) ``` ### Asynchronous Execution For high-throughput applications (e.g., web servers, batch processing), use `arun()`: ```python theme={"system"} import asyncio async def process_docs(): pipe = Pipeline().chunk_with("recursive") # Run pipeline asynchronously doc = await pipe.arun(texts="Async processing is fast!") # Process multiple concurrently docs = await pipe.arun(texts=["Doc 1", "Doc 2"]) return docs ``` ## Pipeline Methods ### fetch\_from() Fetch data from a source: ```python theme={"system"} # Single file .fetch_from("file", path="document.txt") # Directory with extension filter .fetch_from("file", dir="./docs", ext=[".txt", ".md"]) ``` ### process\_with() Process data with a chef: ```python theme={"system"} # Text processing .process_with("text") # Markdown processing .process_with("markdown") # Table processing .process_with("table") ``` ### chunk\_with() Chunk documents (required): ```python theme={"system"} # Recursive chunking .chunk_with("recursive", chunk_size=512) # Semantic chunking .chunk_with("semantic", threshold=0.8, chunk_size=1024) # Code chunking .chunk_with("code", chunk_size=512) ``` ### refine\_with() Refine chunks (optional, can chain multiple): ```python theme={"system"} # Add overlap context .refine_with("overlap", context_size=100, method="prefix") # Add embeddings .refine_with("embedding", model="text-embedding-3-small") ``` ### export\_with() Export chunks to formats (optional): ```python theme={"system"} # Export to JSON .export_with("json", file="chunks.json") # Export to Hugging Face Datasets .export_with("datasets", name="my-dataset") ``` ### store\_in() Store in vector databases (optional): ```python theme={"system"} # Store in Chroma .store_in("chroma", collection_name="documents") # Store in Qdrant .store_in("qdrant", collection_name="docs", url="http://localhost:6333") ``` ## Advanced Examples ### RAG Knowledge Base Build a complete RAG ingestion pipeline: ```python theme={"system"} # Ingest documents into vector database docs = (Pipeline() .fetch_from("file", dir="./knowledge_base", ext=[".txt", ".md"]) .process_with("text") .chunk_with("semantic", threshold=0.8, chunk_size=1024) .refine_with("overlap", context_size=100) .store_in("qdrant", collection_name="knowledge", url="http://localhost:6333") .run()) print(f"Ingested {len(docs)} documents") ``` ### Semantic Search Pipeline Process documents with embeddings for search: ```python theme={"system"} # Chunk with embeddings doc = (Pipeline() .fetch_from("file", path="research_paper.txt") .process_with("text") .chunk_with("semantic", threshold=0.8, chunk_size=1024, similarity_window=3) .refine_with("overlap", context_size=100) .refine_with("embedding", model="minishlab/potion-base-32M") .run()) # All chunks now have embeddings for chunk in doc.chunks: if chunk.embedding is not None: print(f"Chunk: {chunk.text[:30]}... | Embedding shape: {chunk.embedding.shape}") ``` ### Code Documentation Process code with specialized chunking: ```python theme={"system"} # Chunk Python files docs = (Pipeline() .fetch_from("file", dir="./src", ext=[".py"]) .chunk_with("code", chunk_size=512) .export_with("json", file="code_chunks.json") .run()) print(f"Processed {len(docs)} Python files") ``` ### Markdown Processing Handle markdown with table and code awareness: ```python theme={"system"} # Process markdown documentation doc = (Pipeline() .fetch_from("file", path="README.md") .process_with("markdown") .chunk_with("recursive", chunk_size=512) .run()) # Access markdown metadata print(f"Found {len(doc.tables)} tables") print(f"Found {len(doc.code)} code blocks") print(f"Created {len(doc.chunks)} chunks") ``` ## Recipe-Based Pipelines Load pre-configured pipelines from the Chonkie Hub: ```python theme={"system"} # Load markdown processing recipe pipeline = Pipeline.from_recipe("markdown") # Run with your content doc = pipeline.run(texts="# My Markdown\n\nContent here") # Load custom local recipe pipeline = Pipeline.from_recipe("custom", path="./my_recipe.json") ``` Recipes are stored in the [chonkie-ai/recipes](https://huggingface.co/datasets/chonkie-ai/recipes) repository. ## Best Practices Explicitly set `chunk_size` for predictable behavior: ```python theme={"system"} # Good - explicit size .chunk_with("recursive", chunk_size=512) # Avoid - uses defaults that may change .chunk_with("recursive") ``` Choose chunkers appropriate for your content: ```python theme={"system"} # Code files → Code chunker .chunk_with("code") # Need semantic similarity → Semantic chunker .chunk_with("semantic", threshold=0.8) # General text → Recursive chunker .chunk_with("recursive") ``` Add overlap refineries for better retrieval context: ```python theme={"system"} .chunk_with("recursive", chunk_size=512) .refine_with("overlap", context_size=100) ``` Always specify file extensions to avoid unwanted files: ```python theme={"system"} # Good - filtered .fetch_from("file", dir="./docs", ext=[".txt", ".md"]) # Bad - processes everything including binaries .fetch_from("file", dir="./docs") ``` Multiple refineries can be chained: ```python theme={"system"} .chunk_with("recursive", chunk_size=512) .refine_with("overlap", context_size=50) .refine_with("embedding", model="text-embedding-3-small") ``` ## Pipeline Validation Pipelines validate configuration before execution: ✅ **Must have**: At least one chunker ✅ **Must have**: Fetcher OR text input via `run(texts=...)` ❌ **Cannot have**: Multiple chefs (only one allowed) ```python theme={"system"} # ❌ Invalid - no chunker Pipeline().fetch_from("file", path="doc.txt").run() # ❌ Invalid - multiple chefs Pipeline() .process_with("text") .process_with("markdown") # Error! .chunk_with("recursive") # ✅ Valid - has chunker and input source Pipeline() .fetch_from("file", path="doc.txt") .chunk_with("recursive", chunk_size=512) .run() # ✅ Valid - text input, no fetcher needed Pipeline() .chunk_with("recursive") .run(texts="Hello world") ``` ## Return Values Pipeline behavior depends on input: * **Single file/text**: Returns `Document` * **Multiple files/texts**: Returns `list[Document]` ```python theme={"system"} # Single file → Document doc = Pipeline().fetch_from("file", path="doc.txt").chunk_with("recursive").run() assert isinstance(doc, Document) # Directory → list[Document] docs = Pipeline().fetch_from("file", dir="./docs").chunk_with("recursive").run() assert isinstance(docs, list) # Multiple texts → list[Document] docs = Pipeline().chunk_with("recursive").run(texts=["t1", "t2"]) assert isinstance(docs, list) ``` ## Error Handling Pipelines provide clear error messages: ```python theme={"system"} from pathlib import Path try: doc = Pipeline() .fetch_from("file", path="missing.txt") .chunk_with("recursive") .run() except FileNotFoundError as e: print(f"File not found: {e}") except ValueError as e: print(f"Configuration error: {e}") except RuntimeError as e: print(f"Pipeline execution failed: {e}") ``` ## Component Overview ### Available Components Explore each component type: Connect to data sources (files, APIs, databases) Preprocess text, markdown, tables, etc. Split text with various strategies Add overlap, embeddings, and more Export to JSON, Datasets, etc. Store in Chroma, Qdrant, Pinecone, etc. ## What's Next? Learn how to connect different data sources in [Fetchers](/oss/fetchers/overview) Find the right chunking strategy in [Chunkers](/oss/chunkers/overview) Improve chunk quality in [Refineries](/oss/refinery/overview) Ingest into vector databases with [Handshakes](/oss/handshakes/overview)