The CodeChunker
splits code into chunks based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments.
API Reference
To use the CodeChunker
via the API, check out the API reference documentation.
Installation
CodeChunker requires additional dependencies for code parsing. You can install it with:
pip install "chonkie[code]"
Initialization
from chonkie import CodeChunker
# Basic initialization with default parameters
chunker = CodeChunker(
language="python", # Specify the programming language
tokenizer_or_token_counter="gpt2", # Tokenizer to use
chunk_size=512, # Maximum tokens per chunk
include_nodes=False # Optionally include AST nodes in output
)
# Using a custom tokenizer
from tokenizers import Tokenizer
custom_tokenizer = Tokenizer.from_pretrained("your-tokenizer")
chunker = CodeChunker(
language="javascript",
tokenizer_or_token_counter=custom_tokenizer,
chunk_size=512
)
Parameters
The programming language of the code. Accepts languages supported by tree-sitter-language-pack
.
tokenizer_or_token_counter
Union[str, Callable, Any]
default:"gpt2"
Tokenizer or token counting function to use for measuring chunk size.
Maximum number of tokens per chunk.
Whether to include the list of corresponding AST Node
objects within each CodeChunk
.
return_type
Literal['chunks', 'texts']
default:"chunks"
Whether to return chunks as CodeChunk
objects or plain text strings.
Single Code Chunking
code = """
def hello_world():
print("Hello, Chonkie!")
class MyClass:
def __init__(self):
self.value = 42
"""
chunks = chunker.chunk(code)
for chunk in chunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
print(f"Language: {chunk.lang}")
if chunk.nodes:
print(f"Node count: {len(chunk.nodes)}")
Batch Chunking
codes = [
"def func1():\n pass",
"const x = 10;\nfunction add(a, b) { return a + b; }"
]
batch_chunks = chunker.chunk_batch(codes)
for doc_chunks in batch_chunks:
for chunk in doc_chunks:
print(f"Chunk: {chunk.text}")
Using as a Callable
# Single code string
chunks = chunker("def greet(name):\n print(f'Hello, {name}')")
# Multiple code strings
batch_chunks = chunker(["int main() { return 0; }", "package main\nimport \"fmt\"\nfunc main() { fmt.Println(\"Hi\") }"])
Return Type
CodeChunker returns chunks as CodeChunk
objects:
@dataclass
class CodeChunk(Chunk):
text: str # The chunk text (code snippet)
start_index: int # Starting position in original code
end_index: int # Ending position in original code
token_count: int # Number of tokens in chunk
lang: Optional[str] = None # Language of the code chunk
nodes: Optional[List[Node]] = None # List of AST nodes if include_nodes=True