Skip to main content
The CodeChunker splits code into chunks based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments.

Overview

Supported Languages

Each language is identified by the key used with get_language(key) and get_parser(key).

General-Purpose Programming Languages

LanguageKeyLicense
ActionScriptactionscriptMIT
AdaadaMIT
AgdaagdaMIT
CcMIT
C++cppMIT
C#csharpMIT
DartdartMIT
GogoMIT
JavajavaMIT
JavaScriptjavascriptMIT
JuliajuliaMIT
KotlinkotlinMIT
NimnimMPL-2.0
OCamlocaml/ocaml_interfaceMIT
PerlperlArtistic-2.0
PythonpythonMIT
RrMIT
RubyrubyMIT
RustrustMIT
ScalascalaMIT
SwiftswiftMIT
TypeScripttypescriptMIT
ZigzigMIT

Web, UI & Markup

LanguageKeyLicense
HTMLhtmlMIT
CSScssMIT
SCSSscssMIT
AstroastroMIT
VuevueMIT
SveltesvelteMIT
TSXtsxMIT
MarkdownmarkdownMIT
Markdown Inlinemarkdown_inlineMIT
MermaidmermaidMIT
XMLxmlMIT
YAMLyamlMIT

Config, Build & DevOps

LanguageKeyLicense
BashbashMIT
DockerfiledockerfileMIT
Git IgnoregitignoreMIT
Git CommitgitcommitWTFPL
MakemakeMIT
NinjaninjaMIT
MesonmesonMIT
PrismaprismaMIT
RequirementsrequirementsMIT

Systems, GPU & Low-level

LanguageKeyLicense
ASMasmMIT
CUDAcudaMIT
GLSLglslMIT
HLSLhlslMIT
LLVMllvmMIT
VerilogverilogMIT
VHDLvhdlMIT
WGSLwgslMIT
WAST / WATwasmApache-2.0 + LLVM

API Reference

To use the CodeChunker via the API, check out the API reference documentation.

Installation

CodeChunker requires additional dependencies for code parsing. You can install it with:
pip install "chonkie[code]"
For installation instructions, see the Installation Guide.

Initialization

from chonkie import CodeChunker

chunker = CodeChunker(
    language="python",      # Specify the programming language
    tokenizer="character",  # Default tokenizer (or use "gpt2", etc.)
    chunk_size=2048,        # Maximum tokens per chunk
    include_nodes=False     # Optionally include AST nodes in output
)
Automatic language detection using Magika can impact performance. For best results, specify the language directly via the language parameter.

Parameters

language
str
required
The programming language of the code. Accepts languages supported by tree-sitter-language-pack.
tokenizer
Union[str, Callable, Any]
default:"character"
Tokenizer or token counting function to use for measuring chunk size.
chunk_size
int
default:"2048"
Maximum number of tokens per chunk.
include_nodes
bool
default:"False"
Whether to include AST node information (Note: with the base Chunk type, node information is not stored).

Usage

Single Code Chunking

code = """
def hello_world():
    print("Hello, Chonkie!")

class MyClass:
    def __init__(self):
        self.value = 42
"""
chunks = chunker.chunk(code)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")

Batch Chunking

codes = [
    "def func1():\n    pass",
    "const x = 10;\nfunction add(a, b) { return a + b; }"
]
batch_chunks = chunker.chunk_batch(codes)

for doc_chunks in batch_chunks:
    for chunk in doc_chunks:
        print(f"Chunk: {chunk.text}")

Using as a Callable

# Single code string
chunks = chunker("def greet(name):\n    print(f'Hello, {name}')")

# Multiple code strings
batch_chunks = chunker(["int main() { return 0; }", "package main\nimport \"fmt\"\nfunc main() { fmt.Println(\"Hi\") }"])

Return Type

CodeChunker returns chunks as Chunk objects:
@dataclass
class Chunk:
    text: str           # The chunk text (code snippet)
    start_index: int    # Starting position in original code
    end_index: int      # Ending position in original code
    token_count: int    # Number of tokens in chunk
    context: Optional[Context] = None    # Optional context metadata
    embedding: Union[list[float], "np.ndarray", None] = None  # Optional embedding vector
As of version 1.3.0, CodeChunker returns the base Chunk type instead of the specialized CodeChunk type. This simplifies integration with other chunkers and refineries.