Build powerful text processing workflows with Chonkie’s Pipeline API
Chonkie’s Pipeline API provides a fluent, chainable interface for building text processing workflows. Pipelines follow the CHOMP architecture, automatically orchestrating components in the correct order.
# Process all markdown files in a directorydocs = (Pipeline() .fetch_from("file", dir="./documents", ext=[".md", ".txt"]) .process_with("text") .chunk_with("recursive", chunk_size=512) .run())# Process each documentfor doc in docs: print(f"Document has {len(doc.chunks)} chunks")
# Store in Chroma.store_in("chroma", collection_name="documents")# Store in Qdrant.store_in("qdrant", collection_name="docs", url="http://localhost:6333")
# Chunk with embeddingsdoc = (Pipeline() .fetch_from("file", path="research_paper.txt") .process_with("text") .chunk_with("semantic", threshold=0.8, chunk_size=1024, similarity_window=3) .refine_with("overlap", context_size=100) .refine_with("embedding", model="minishlab/potion-base-32M") .run())# All chunks now have embeddingsfor chunk in doc.chunks: if chunk.embedding is not None: print(f"Chunk: {chunk.text[:30]}... | Embedding shape: {chunk.embedding.shape}")
Load pre-configured pipelines from the Chonkie Hub:
Copy
Ask AI
# Load markdown processing recipepipeline = Pipeline.from_recipe("markdown")# Run with your contentdoc = pipeline.run(texts="# My Markdown\n\nContent here")# Load custom local recipepipeline = Pipeline.from_recipe("custom", path="./my_recipe.json")
Always specify file extensions to avoid unwanted files:
Copy
Ask AI
# Good - filtered.fetch_from("file", dir="./docs", ext=[".txt", ".md"])# Bad - processes everything including binaries.fetch_from("file", dir="./docs")
Pipelines validate configuration before execution:✅ Must have: At least one chunker
✅ Must have: Fetcher OR text input via run(texts=...)
❌ Cannot have: Multiple chefs (only one allowed)
Copy
Ask AI
# ❌ Invalid - no chunkerPipeline().fetch_from("file", path="doc.txt").run()# ❌ Invalid - multiple chefsPipeline() .process_with("text") .process_with("markdown") # Error! .chunk_with("recursive")# ✅ Valid - has chunker and input sourcePipeline() .fetch_from("file", path="doc.txt") .chunk_with("recursive", chunk_size=512) .run()# ✅ Valid - text input, no fetcher neededPipeline() .chunk_with("recursive") .run(texts="Hello world")