Chonkie Documentation

The CodeChunker splits code into chunks based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments.

API Reference

To use the CodeChunker via the API, check out the API reference documentation.

Initialization

import { CodeChunker } from "chonkie";

// Basic initialization
// NOTE: Language is required!
const chunker = await CodeChunker.create({
    lang: "typescript"
});

// Using a custom tokenizer
import { Tokenizer } from "@huggingface/transformers";
const tokenizer = await Tokenizer.from_pretrained("Xenova/gpt2");
const chunker = await CodeChunker.create({
    lang: "typescript",
    tokenizer
});

Parameters

lang

string

required

Programming language of the code to chunk.

tokenizer

string | Tokenizer

default:"Xenova/gpt2"

Tokenizer to use. Can be a string identifier (model name) or a Tokenizer instance. Defaults to Xenova/gpt2.

chunkSize

number

default:"512"

Maximum number of tokens per chunk.

includeNodes

boolean

default:"false"

Whether to include the list of corresponding AST Node objects within each CodeChunk.

Usage

Single Code Chunking

import { CodeChunker } from "chonkie";

const code = "add = lambda x, y: x + y";
const chunker = await CodeChunker.create({
    lang: "python"
});
const chunks = await chunker.chunk(code);

Batch Code Chunking

import { CodeChunker } from "chonkie";

const codes = [
    "add = lambda x, y: x + y",
    "subtract = lambda x, y: x - y",
    "multiply = lambda x, y: x * y",
    "divide = lambda x, y: x / y"
];
const chunker = await CodeChunker.create({
    lang: "python"
});

const chunks = await chunker.chunkBatch(codes);

Return Type

CodeChunker returns chunks as CodeChunk objects:

class CodeChunk {
    text: string;
    start_index: number;
    end_index: number;
    token_count: number;
    lang: string;
    nodes: Node[];
}

Getting Started

Chunkers

Code Chunker

API Reference

Initialization

Parameters

Usage

Single Code Chunking

Batch Code Chunking

Return Type

Getting Started

Chunkers

​API Reference

​Initialization

​Parameters

​Usage

​Single Code Chunking

​Batch Code Chunking

​Return Type

API Reference

Initialization

Parameters

Usage

Single Code Chunking

Batch Code Chunking

Return Type