CodeChunker
Advanced AST-based code chunking with intelligent semantic preservation
The experimental CodeChunker provides advanced AST-based code parsing that goes beyond simple line-based splitting to understand and preserve code structure and semantics.
Experimental Feature: This CodeChunker is experimental and may change significantly between versions. Use with caution in production environments.
Key Features
- AST-based parsing using tree-sitter for accurate code understanding
- Automatic language detection using Magika for seamless multi-language handling
- Language-specific rules for optimal chunking based on programming language
- Intelligent grouping of related code elements (imports, comments, classes)
- Semantic preservation prioritizes code coherence over strict size limits
- Multi-language support for popular programming languages
- Recursive splitting for large code constructs when chunk size is specified
Installation
To use the experimental CodeChunker, you need the code dependencies:
Supported Languages
The experimental CodeChunker supports the following programming languages:
- Python - Classes, functions, imports, docstrings
- TypeScript - Functions, classes, interfaces, modules
- JavaScript - Functions, classes, modules, JSX
- Rust - Functions, structs, modules, traits
- Go - Functions, structs, packages, interfaces
- Java - Classes, methods, packages, interfaces
- C - Functions, structs, headers
- C++ - Functions, classes, namespaces, structs
- C# - Classes, methods, namespaces, properties
- HTML - Tags, elements, attributes
- CSS - Rules, selectors, properties
- Markdown - Headers, sections, code blocks
Basic Usage
Advanced Configuration
With Chunk Size Limit
Language Auto-Detection
The experimental CodeChunker can automatically detect the programming language using Magika, Google’s deep learning-based language detection model:
Performance Consideration: When using language="auto"
, the chunker will show a warning that auto-detection may affect performance. For better performance in production, specify the language explicitly when known.
Split Context Control
Understanding Chunk Behavior
Semantic Preservation
The experimental CodeChunker prioritizes semantic coherence over strict size limits:
Language-Specific Grouping
Different languages have different grouping behaviors:
Best Practices
Choose Appropriate Chunk Sizes
Language-Specific Considerations
Output Format
Each chunk contains detailed metadata about the code structure:
Limitations
Current Limitations:
- Experimental status: APIs may change between versions
- Performance: AST parsing may be slower than simple text splitting
- Language support: Not all programming languages are supported yet
- Size flexibility: Chunks may significantly exceed specified size limits
- Dependencies: Requires tree-sitter and language packs
Migration from Stable CodeChunker
If migrating from the stable CodeChunker to the experimental version:
Feedback and Support
Since this is an experimental feature, your feedback is valuable:
- Report issues on GitHub
- Share use cases to help improve the chunker
- Test with your code and let us know what works well or needs improvement
The experimental CodeChunker will eventually replace or supplement the stable CodeChunker based on community feedback and testing results.