> ## Documentation Index
> Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Code Chunker

> Split code into chunks based on code structure

The `CodeChunker` splits code into chunks based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments.

## Overview

* Supports 165+ languages
* Powered by [tree-sitter-language-pack](https://github.com/Goldziher/tree-sitter-language-pack)
* Auto language detection via [Magika](https://github.com/google/magika), a language detection library made by Google

## Supported Languages

<Accordion title="Show all supported languages">
  Each language is identified by the **key** used with `get_language(key)` and `get_parser(key)`.

  ### General-Purpose Programming Languages

  | Language     | Key                     | License      |
  | ------------ | ----------------------- | ------------ |
  | ActionScript | `actionscript`          | MIT          |
  | Ada          | `ada`                   | MIT          |
  | Agda         | `agda`                  | MIT          |
  | C            | `c`                     | MIT          |
  | C++          | `cpp`                   | MIT          |
  | C#           | `csharp`                | MIT          |
  | Dart         | `dart`                  | MIT          |
  | Go           | `go`                    | MIT          |
  | Java         | `java`                  | MIT          |
  | JavaScript   | `javascript`            | MIT          |
  | Julia        | `julia`                 | MIT          |
  | Kotlin       | `kotlin`                | MIT          |
  | Nim          | `nim`                   | MPL-2.0      |
  | OCaml        | `ocaml/ocaml_interface` | MIT          |
  | Perl         | `perl`                  | Artistic-2.0 |
  | Python       | `python`                | MIT          |
  | R            | `r`                     | MIT          |
  | Ruby         | `ruby`                  | MIT          |
  | Rust         | `rust`                  | MIT          |
  | Scala        | `scala`                 | MIT          |
  | Swift        | `swift`                 | MIT          |
  | TypeScript   | `typescript`            | MIT          |
  | Zig          | `zig`                   | MIT          |

  ### Web, UI & Markup

  | Language        | Key               | License |
  | --------------- | ----------------- | ------- |
  | HTML            | `html`            | MIT     |
  | CSS             | `css`             | MIT     |
  | SCSS            | `scss`            | MIT     |
  | Astro           | `astro`           | MIT     |
  | Vue             | `vue`             | MIT     |
  | Svelte          | `svelte`          | MIT     |
  | TSX             | `tsx`             | MIT     |
  | Markdown        | `markdown`        | MIT     |
  | Markdown Inline | `markdown_inline` | MIT     |
  | Mermaid         | `mermaid`         | MIT     |
  | XML             | `xml`             | MIT     |
  | YAML            | `yaml`            | MIT     |

  ### Config, Build & DevOps

  | Language     | Key            | License |
  | ------------ | -------------- | ------- |
  | Bash         | `bash`         | MIT     |
  | Dockerfile   | `dockerfile`   | MIT     |
  | Git Ignore   | `gitignore`    | MIT     |
  | Git Commit   | `gitcommit`    | WTFPL   |
  | Make         | `make`         | MIT     |
  | Ninja        | `ninja`        | MIT     |
  | Meson        | `meson`        | MIT     |
  | Prisma       | `prisma`       | MIT     |
  | Requirements | `requirements` | MIT     |

  ### Systems, GPU & Low-level

  | Language   | Key       | License           |
  | ---------- | --------- | ----------------- |
  | ASM        | `asm`     | MIT               |
  | CUDA       | `cuda`    | MIT               |
  | GLSL       | `glsl`    | MIT               |
  | HLSL       | `hlsl`    | MIT               |
  | LLVM       | `llvm`    | MIT               |
  | Verilog    | `verilog` | MIT               |
  | VHDL       | `vhdl`    | MIT               |
  | WGSL       | `wgsl`    | MIT               |
  | WAST / WAT | `wasm`    | Apache-2.0 + LLVM |
</Accordion>

## API Reference

To use the `CodeChunker` via the API, check out the [API reference documentation](../../api/chunkers/code-chunker).

## Installation

CodeChunker requires additional dependencies for code parsing. You can install it with:

```bash theme={"system"}
pip install "chonkie[code]"
```

<Info>
  For installation instructions, see the [Installation
  Guide](/oss/installation).
</Info>

## Initialization

<CodeGroup>
  ```python Basic initialization theme={"system"}
  from chonkie import CodeChunker

  chunker = CodeChunker(
      language="python",      # Specify the programming language
      tokenizer="character",  # Default tokenizer (or use "gpt2", etc.)
      chunk_size=2048,        # Maximum tokens per chunk
      include_nodes=False     # Optionally include AST nodes in output
  )
  ```

  ```python Auto  theme={"system"}
  from chonkie import CodeChunker

  chunker = CodeChunker(
      language="auto",        # Auto detects programming language via Magika
      tokenizer="character",  # Default tokenizer (or use "gpt2", etc.)
      chunk_size=2048,        # Maximum tokens per chunk
      include_nodes=False     # Optionally include AST nodes in output
  )
  ```

  ```python Custom tokenizer theme={"system"}
  from chonkie import CodeChunker
  from tokenizers import Tokenizer

  custom_tokenizer = Tokenizer.from_pretrained("your-tokenizer")
  chunker = CodeChunker(
      language="javascript",
      tokenizer=custom_tokenizer,
      chunk_size=2048
  )
  ```
</CodeGroup>

<Note>
  Automatic language detection using Magika can impact performance. For best results, specify the language directly via the `language` parameter.
</Note>

## Parameters

<ParamField path="language" type="str" required>
  The programming language of the code. Accepts languages supported by
  `tree-sitter-language-pack`.
</ParamField>

<ParamField path="tokenizer" type="Union[str, Callable, Any]" default="character">
  Tokenizer or token counting function to use for measuring chunk size.
</ParamField>

<ParamField path="chunk_size" type="int" default="2048">
  Maximum number of tokens per chunk.
</ParamField>

<ParamField path="include_nodes" type="bool" default="False">
  Whether to include AST node information (Note: with the base Chunk type, node
  information is not stored).
</ParamField>

## Usage

### Single Code Chunking

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    code = """
    def hello_world():
        print("Hello, Chonkie!")

    class MyClass:
        def __init__(self):
            self.value = 42
    """
    chunks = chunker.chunk(code)

    for chunk in chunks:
        print(f"Chunk text: {chunk.text}")
        print(f"Token count: {chunk.token_count}")
    ```
  </Tab>

  <Tab title="JavaScript">
    ```javascript theme={"system"}
    const code = `
    def hello_world():
        print("Hello, Chonkie!")

    class MyClass:
        def __init__(self):
            self.value = 42
    `;
    const chunks = await chunker.chunk(code);

    for (const chunk of chunks) {
      console.log(`Chunk text: ${chunk.text}`);
      console.log(`Token count: ${chunk.tokenCount}`);
    }
    ```
  </Tab>
</Tabs>

### Batch Chunking

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    codes = [
        "def func1():\n    pass",
        "const x = 10;\nfunction add(a, b) { return a + b; }"
    ]
    batch_chunks = chunker.chunk_batch(codes)

    for doc_chunks in batch_chunks:
        for chunk in doc_chunks:
            print(f"Chunk: {chunk.text}")
    ```
  </Tab>

  <Tab title="JavaScript">
    ```javascript theme={"system"}
    const codes = [
      "def func1():\n    pass",
      "const x = 10;\nfunction add(a, b) { return a + b; }"
    ];
    const batchChunks = await chunker.chunkBatch(codes);

    for (const docChunks of batchChunks) {
      for (const chunk of docChunks) {
        console.log(`Chunk: ${chunk.text}`);
      }
    }
    ```
  </Tab>
</Tabs>

### Using as a Callable

```python theme={"system"}
# Single code string
chunks = chunker("def greet(name):\n    print(f'Hello, {name}')")

# Multiple code strings
batch_chunks = chunker(["int main() { return 0; }", "package main\nimport \"fmt\"\nfunc main() { fmt.Println(\"Hi\") }"])
```

## Return Type

CodeChunker returns chunks as `Chunk` objects:

```python theme={"system"}
@dataclass
class Chunk:
    text: str           # The chunk text (code snippet)
    start_index: int    # Starting position in original code
    end_index: int      # Ending position in original code
    token_count: int    # Number of tokens in chunk
    context: Optional[Context] = None    # Optional context metadata
    embedding: Union[list[float], "np.ndarray", None] = None  # Optional embedding vector
```

<Note>
  As of version 1.3.0, CodeChunker returns the base `Chunk` type instead of the
  specialized `CodeChunk` type. This simplifies integration with other chunkers
  and refineries.{" "}
</Note>
