> ## Documentation Index > Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt > Use this file to discover all available pages before exploring further. # Open Source > The Open Source Library For RAG

*✨Look Inside! We're Open Source!✨*

**Chonkie's Open Source library** provides lightweight, and high-performance features for building modern RAG applications. Install it locally, run anywhere, and keep full control over your chunking pipeline. ## Why Chonkie OSS? Released under the MIT license. Use however you like. All processing happens locally. Your data never leaves your infrastructure. Battle-tested algorithms used by thousands of developers. Optimized for speed and reliability. Optimized with caching, parallel processing, and fast tokenizers. Process millions of chunks efficiently. ## Core Capabilities ### Advanced Chunkers Chonkie OSS includes a comprehensive suite of chunking algorithms, each designed for specific document types and use cases: **Best for**: General-purpose chunking, most use cases Splits text into fixed-size token chunks with configurable overlap. The most straightforward and reliable chunking strategy. Available in: Python, JavaScript **Best for**: Q\&A systems, maintaining complete thoughts Chunks at sentence boundaries while respecting token limits. Ensures sentences are never split mid-thought. Available in: Python, JavaScript **Best for**: Markdown, structured documents, hierarchical content Hierarchically chunks using multiple delimiters—paragraphs, then sentences, then words. Preserves document structure naturally. Available in: Python, JavaScript **Best for**: High-throughput pipelines, large-scale document processing SIMD-accelerated chunking with 100+ GB/s throughput. Uses byte-size limits for extreme performance without tokenization overhead. Available in: Python, JavaScript **Best for**: Markdown tables, tabular data Splits large tables into manageable chunks by rows while preserving headers. Perfect for data-heavy documents. Available in: Python, JavaScript **Best for**: Multi-topic documents, maintaining topical coherence Uses embeddings to identify natural topic boundaries. Creates chunks based on semantic similarity, not just structure. Includes Savitzky-Golay filtering and skip-window merging for advanced boundary detection. Available in: Python, JavaScript **Best for**: Retrieval optimization, higher recall RAG systems Implements the Late Chunking algorithm from research. Generates document-level embeddings first, then derives chunk embeddings for richer contextual representation. Available in: Python **Best for**: Source code, API documentation, technical content Language-aware chunking using Abstract Syntax Trees (AST). Preserves function and class boundaries for better code understanding. Available in: Python, JavaScript **Best for**: Maximum quality, complex documents with subtle topic shifts Uses a fine-tuned BERT model to detect semantic shifts in text. ML-powered boundary detection for topic-coherent chunks. Available in: Python **Best for**: Books, research papers, when quality matters most Agentic chunking powered by LLMs via the Genie interface. Uses generative models (Gemini, OpenAI, etc.) to intelligently determine optimal chunk boundaries. Available in: Python ### Embedding Providers Flexible embedding support for semantic chunking and refineries: * **AutoEmbeddings** - Automatically select the best embeddings for your use case * **Model2VecEmbeddings** - Ultra-fast static embeddings (default for semantic chunking) * **SentenceTransformerEmbeddings** - Hugging Face Sentence Transformers models * **OpenAIEmbeddings** - OpenAI's text-embedding models * **AzureOpenAIEmbeddings** - Azure-hosted OpenAI embeddings * **CohereEmbeddings** - Cohere's embedding models * **JinaEmbeddings** - Jina AI embeddings * **GeminiEmbeddings** - Google Gemini embeddings * **VoyageAIEmbeddings** - Voyage AI embeddings * **Custom Embeddings** - Bring your own embedding model All embeddings follow a consistent interface and can be swapped seamlessly. ### Refineries Enhance your chunks with additional context and embeddings: Adds contextual overlap between chunks to prevent information loss at boundaries. Configurable overlap sizes for optimal retrieval. Generates and attaches vector embeddings to your chunks. Supports all major embedding providers with automatic dimension detection. ### Database Handshakes Seamlessly connect Chonkie to your favorite database: Ephemeral or persistent ChromaDB instances High-performance vector search with Qdrant Knowledge graph + vector search with Weaviate Serverless vector database by Turbopuffer Managed vector database with Pinecone PostgreSQL with pgvector extension MongoDB Atlas Vector Search Elasticsearch vector search Each handshake provides a simple interface to embed chunks and write them directly to your database. ### Chefs Chefs automatically prepare raw data for chunking: * **TableChef** - Extracts tables from markdown text * **TextChef** - Processes plain text files into structured Documents * **MarkdownChef** - Parses markdown with tables, code blocks, and images ### Porters Export chunks to common formats: * **JSONPorter** - Export chunks to JSON for storage or processing * **DatasetsPorter** - Export to Hugging Face Datasets format ### Utils * **Visualizer** - Rich text visualization of chunks with color-coded boundaries * **Hubbie** - Hugging Face Hub integration for sharing and loading chunkers ## Language Support **Full Feature Set** All chunkers, embedding providers, refineries, handshakes, chefs, and porters available. Choose from minimal to full installations based on your needs. * Default install: Token, Sentence, Recursive, Table chunkers * Semantic install: + SemanticChunker, LateChunker, NeuralChunker with Model2Vec * All install: Every feature available **Core Chunking** JavaScript support includes the most commonly used chunkers: * TokenChunker * SentenceChunker * RecursiveChunker * FastChunker * TableChunker * SemanticChunker * CodeChunker Available via `@chonkiejs/core` package with full TypeScript support. Other chunkers available through the Chonkie Cloud API via `@chonkiejs/cloud`. To use custom tokenizers with the chunkers, install `@chonkiejs/token` ## Performance Characteristics Chonkie OSS is optimized for speed: * **Pipelining** - Efficient multi-stage processing * **Caching** - Smart caching to avoid recomputation * **Fast Tokenizers** - TikToken and AutoTikTokenizer for speed * **Parallel Processing** - Multi-threaded batch operations * **Ultra-fast Embeddings** - Model2Vec static embeddings (default) * **Token Estimate-Validate** - Efficient feedback loops for optimal chunk sizes Process thousands of documents per second on commodity hardware. ## Next Steps Ready to get started with Chonkie OSS? Install and create your first chunk in under 2 minutes Detailed installation options for all features Explore all chunking algorithms in detail Star the repo and contribute to the project *** Need hosted chunking with zero setup? Check out our [Chunking API](/common/chunking-api) for a managed solution.