> ## Documentation Index > Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt > Use this file to discover all available pages before exploring further. # Code Chunker > Advanced AST-based code chunking with intelligent semantic preservation The experimental CodeChunker provides advanced AST-based code parsing that goes beyond simple line-based splitting to understand and preserve code structure and semantics. **Experimental Feature**: This CodeChunker is experimental and may change significantly between versions. Use with caution in production environments. ## Key Features * **AST-based parsing** using tree-sitter for accurate code understanding * **Automatic language detection** using Magika for seamless multi-language handling * **Language-specific rules** for optimal chunking based on programming language * **Intelligent grouping** of related code elements (imports, comments, classes) * **Semantic preservation** prioritizes code coherence over strict size limits * **Multi-language support** for popular programming languages * **Recursive splitting** for large code constructs when chunk size is specified ## Installation To use the experimental CodeChunker, you need the code dependencies: ```bash theme={"system"} pip install chonkie[code] ``` ## Supported Languages The experimental CodeChunker supports the following programming languages: * **Python** - Classes, functions, imports, docstrings * **TypeScript** - Functions, classes, interfaces, modules * **JavaScript** - Functions, classes, modules, JSX * **Rust** - Functions, structs, modules, traits * **Go** - Functions, structs, packages, interfaces * **Java** - Classes, methods, packages, interfaces * **C** - Functions, structs, headers * **C++** - Functions, classes, namespaces, structs * **C#** - Classes, methods, namespaces, properties * **HTML** - Tags, elements, attributes * **CSS** - Rules, selectors, properties * **Markdown** - Headers, sections, code blocks ## Basic Usage ```python theme={"system"} from chonkie.experimental import CodeChunker # Create a code chunker for Python chunker = CodeChunker(language="python") # Chunk some Python code code = ''' import os from typing import List def process_files(directory: str) -> list[str]: """Process all files in a directory.""" files = [] for filename in os.listdir(directory): if filename.endswith('.py'): files.append(filename) return files class FileProcessor: def __init__(self, base_dir: str): self.base_dir = base_dir self.processed_count = 0 def process(self, filename: str) -> bool: """Process a single file.""" # Processing logic here self.processed_count += 1 return True ''' chunks = chunker.chunk(code) for i, chunk in enumerate(chunks): print(f"Chunk {i+1}:") print(chunk.text) print("---") ``` ## Advanced Configuration ### With Chunk Size Limit ```python theme={"system"} # Set a chunk size limit (chunks may exceed this to preserve semantics) chunker = CodeChunker( language="python", chunk_size=2048, # Target chunk size in characters tokenizer="character" ) ``` ### Language Auto-Detection The experimental CodeChunker can automatically detect the programming language using Magika, Google's deep learning-based language detection model: ```python theme={"system"} # Let the chunker detect the language automatically chunker = CodeChunker(language="auto") # Chunk different types of code - language is detected automatically python_code = ''' def fibonacci(n): if n <= 1: return n return fibonacci(n-1) + fibonacci(n-2) ''' javascript_code = ''' function fibonacci(n) { if (n <= 1) return n; return fibonacci(n-1) + fibonacci(n-2); } ''' rust_code = ''' fn fibonacci(n: u32) -> u32 { if n <= 1 { n } else { fibonacci(n-1) + fibonacci(n-2) } } ''' # All will be chunked with appropriate language-specific rules python_chunks = chunker.chunk(python_code) # Detected as Python js_chunks = chunker.chunk(javascript_code) # Detected as JavaScript rust_chunks = chunker.chunk(rust_code) # Detected as Rust ``` **Performance Consideration**: When using `language="auto"`, the chunker will show a warning that auto-detection may affect performance. For better performance in production, specify the language explicitly when known. ### Split Context Control ```python theme={"system"} # Control whether to add split context information chunker = CodeChunker( language="typescript", add_split_context=True # Include context about split locations ) ``` ## Understanding Chunk Behavior ### Semantic Preservation The experimental CodeChunker prioritizes semantic coherence over strict size limits: ```python theme={"system"} chunker = CodeChunker(language="python", chunk_size=100) # This class will likely stay together even if it exceeds 100 characters code = ''' class SmallButImportant: def __init__(self): self.value = "important" def get_value(self): return self.value ''' chunks = chunker.chunk(code) # The class will typically be kept as one chunk for semantic coherence ``` ### Language-Specific Grouping Different languages have different grouping behaviors: ```python theme={"system"} # Python code is grouped by logical units python_code = ''' import numpy as np import pandas as pd def data_processor(): """Process data using pandas.""" return pd.DataFrame() class DataAnalyzer: def analyze(self, data): return np.mean(data) ''' # Likely chunks: # 1. Import statements together # 2. Function definition # 3. Class definition ``` ```javascript theme={"system"} // JavaScript/TypeScript grouping const code = ` import { Component } from 'react'; import { useState } from 'react'; export const MyComponent = () => { const [state, setState] = useState(null); return

{state}

; }; export class DataService { async fetchData() { return fetch('/api/data'); } } `; // Likely chunks: // 1. Import statements // 2. Component definition // 3. Class definition ``` ```rust theme={"system"} // Rust code grouping let rust_code = r#" use std::collections::HashMap; use serde::{Deserialize, Serialize}; #[derive(Debug, Serialize, Deserialize)] pub struct User { id: u32, name: String, } impl User { pub fn new(id: u32, name: String) -> Self { Self { id, name } } } "#; // Likely chunks: // 1. Use statements // 2. Struct definition with derives // 3. Implementation block ``` ## Best Practices ### Choose Appropriate Chunk Sizes ```python theme={"system"} # For code analysis tasks chunker = CodeChunker(language="python", chunk_size=1024) # For embedding generation (smaller chunks often work better) chunker = CodeChunker(language="python", chunk_size=2048) # No size limit (preserve all semantic units) chunker = CodeChunker(language="python", chunk_size=None) ``` ### Language-Specific Considerations ```python theme={"system"} # For web development files with mixed content html_chunker = CodeChunker(language="html", chunk_size=800) # For documentation with code examples md_chunker = CodeChunker(language="markdown", chunk_size=600) # For system-level code that needs precise structure c_chunker = CodeChunker(language="c", chunk_size=1200) ``` ## Output Format Each chunk contains detailed metadata about the code structure: ```python theme={"system"} chunks = chunker.chunk(code) for chunk in chunks: print(f"Text: {chunk.text}") print(f"Start: {chunk.start_index}") print(f"End: {chunk.end_index}") print(f"Token count: {chunk.token_count}") ``` ## Limitations **Current Limitations**: * **Experimental status**: APIs may change between versions * **Performance**: AST parsing may be slower than simple text splitting * **Language support**: Not all programming languages are supported yet * **Size flexibility**: Chunks may significantly exceed specified size limits * **Dependencies**: Requires tree-sitter and language packs ## Migration from Stable CodeChunker If migrating from the stable CodeChunker to the experimental version: ```python theme={"system"} # Old stable version from chonkie import CodeChunker # New experimental version from chonkie.experimental import CodeChunker # The API is similar but with enhanced capabilities chunker = CodeChunker(language="python", chunk_size=2048) ``` ## Feedback and Support Since this is an experimental feature, your feedback is valuable: * **Report issues** on [GitHub](https://github.com/chonkie-inc/chonkie) * **Share use cases** to help improve the chunker * **Test with your code** and let us know what works well or needs improvement The experimental CodeChunker will eventually replace or supplement the stable CodeChunker based on community feedback and testing results.