The CodeChunker
splits code into chunks based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments.
API Reference
To use the CodeChunker
via the API, check out the API reference documentation.
Installation
CodeChunker requires additional dependencies for code parsing. You can install it with:
pip install "chonkie[code]"
Initialization
from chonkie import CodeChunker
# Basic initialization with default parameters
chunker = CodeChunker(
language="python", # Specify the programming language
tokenizer="character", # Default tokenizer (or use "gpt2", etc.)
chunk_size=2048, # Maximum tokens per chunk
include_nodes=False # Optionally include AST nodes in output
)
# Using a custom tokenizer
from tokenizers import Tokenizer
custom_tokenizer = Tokenizer.from_pretrained("your-tokenizer")
chunker = CodeChunker(
language="javascript",
tokenizer=custom_tokenizer,
chunk_size=2048
)
Parameters
The programming language of the code. Accepts languages supported by
tree-sitter-language-pack
.
tokenizer
Union[str, Callable, Any]
default:"character"
Tokenizer or token counting function to use for measuring chunk size.
Maximum number of tokens per chunk.
Whether to include AST node information (Note: with the base Chunk type, node
information is not stored).
Usage
Single Code Chunking
code = """
def hello_world():
print("Hello, Chonkie!")
class MyClass:
def __init__(self):
self.value = 42
"""
chunks = chunker.chunk(code)
for chunk in chunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
print(f"Language: {chunk.lang}")
if chunk.nodes:
print(f"Node count: {len(chunk.nodes)}")
Batch Chunking
codes = [
"def func1():\n pass",
"const x = 10;\nfunction add(a, b) { return a + b; }"
]
batch_chunks = chunker.chunk_batch(codes)
for doc_chunks in batch_chunks:
for chunk in doc_chunks:
print(f"Chunk: {chunk.text}")
Using as a Callable
# Single code string
chunks = chunker("def greet(name):\n print(f'Hello, {name}')")
# Multiple code strings
batch_chunks = chunker(["int main() { return 0; }", "package main\nimport \"fmt\"\nfunc main() { fmt.Println(\"Hi\") }"])
Return Type
CodeChunker returns chunks as Chunk
objects:
@dataclass
class Chunk:
text: str # The chunk text (code snippet)
start_index: int # Starting position in original code
end_index: int # Ending position in original code
token_count: int # Number of tokens in chunk
context: Optional[Context] = None # Optional context metadata
embedding: Union[List[float], "np.ndarray", None] = None # Optional embedding vector
As of version 1.3.0, CodeChunker returns the base Chunk
type instead of the
specialized CodeChunk
type. This simplifies integration with other chunkers
and refineries.