The CodeChunker splits code into chunks based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments.

API Reference

To use the CodeChunker via the API, check out the API reference documentation.

Initialization

import { CodeChunker } from "chonkie";

// Basic initialization
// NOTE: Language is required!
const chunker = await CodeChunker.create({
    lang: "typescript"
});

// Using a custom tokenizer
import { Tokenizer } from "@huggingface/transformers";
const tokenizer = await Tokenizer.from_pretrained("Xenova/gpt2");
const chunker = await CodeChunker.create({
    lang: "typescript",
    tokenizer
});

Parameters

lang
string
required

Programming language of the code to chunk.

tokenizer
string | Tokenizer
default:"Xenova/gpt2"

Tokenizer to use. Can be a string identifier (model name) or a Tokenizer instance. Defaults to Xenova/gpt2.

chunkSize
number
default:"512"

Maximum number of tokens per chunk.

includeNodes
boolean
default:"false"

Whether to include the list of corresponding AST Node objects within each CodeChunk.

Usage

Single Code Chunking

import { CodeChunker } from "chonkie";

const code = "add = lambda x, y: x + y";
const chunker = await CodeChunker.create({
    lang: "python"
});
const chunks = await chunker.chunk(code);

Batch Code Chunking

import { CodeChunker } from "chonkie";

const codes = [
    "add = lambda x, y: x + y",
    "subtract = lambda x, y: x - y",
    "multiply = lambda x, y: x * y",
    "divide = lambda x, y: x / y"
];
const chunker = await CodeChunker.create({
    lang: "python"
});

const chunks = await chunker.chunkBatch(codes);

Return Type

CodeChunker returns chunks as CodeChunk objects:

class CodeChunk {
    text: string;
    start_index: number;
    end_index: number;
    token_count: number;
    lang: string;
    nodes: Node[];
}