RecursiveChunker
Recursively chunk documents into smaller, semantically meaningful pieces using customizable rules.
The RecursiveChunker
recursively splits documents into smaller, semantically meaningful chunks using customizable hierarchical rules. It is ideal for long or structured documents (e.g., books, research papers, technical docs) where you want to preserve logical structure while respecting token limits.
Installation
RecursiveChunker is included in the base installation of Chonkie TS. No additional dependencies are required.
Initialization
Parameters
Tokenizer to use. Can be a string identifier (model name) or a Tokenizer instance. Defaults to using Xenova/gpt2
tokenizer.
Maximum number of tokens per chunk.
Rules that define how text should be recursively chunked. Allows for hierarchical, multi-level splitting (e.g., paragraphs, then sentences, then tokens). See RecursiveRules below.
Minimum number of characters per chunk. Chunks shorter than this may be merged with adjacent chunks.
Whether to return chunks as RecursiveChunk
objects (with metadata) or plain text strings.
Usage
Single Text Chunking
Batch Chunking
Using as a Callable
Return Type
RecursiveChunker returns chunks as RecursiveChunk
objects by default. Each chunk includes metadata:
If returnType
is set to 'texts'
, only the chunked text strings are returned.
RecursiveRules & RecursiveLevel
The rules
parameter allows for highly flexible, hierarchical chunking strategies. You can specify a list of levels, each with its own delimiters or whitespace splitting. For example:
Each level is a RecursiveLevel
:
delimiters
: Custom string(s) to split on (cannot be used withwhitespace
).whitespace
: If true, splits on whitespace (cannot be used withdelimiters
).includeDelim
: Whether to include the delimiter with the previous chunk ("prev"
, default) or the next chunk ("next"
).
See the DOCS.md for more on advanced rule customization.
Notes:
- The chunker is directly callable as a function after creation:
const chunks = await chunker(text)
orawait chunker([text1, text2])
. - If
returnType
is set to'chunks'
, each chunk includes metadata:text
,startIndex
,endIndex
,tokenCount
, andlevel
(recursion depth). - The
rules
parameter allows for hierarchical chunking (e.g., paragraphs → sentences → tokens). SeeRecursiveRules
andRecursiveLevel
for customization. - Chunks shorter than
minCharactersPerChunk
may be merged with adjacent chunks. - The chunker ensures that no chunk exceeds the specified
chunkSize
in tokens. - The
chunkBatch
method (or calling with an array) allows efficient batch processing. - For more details, see the TypeScript API Reference.