The RecursiveChunker is a chunker that recursively chunks documents into smaller chunks.
It is a good choice for documents that are long but well structured, for example, a book or a research paper.
API Reference
To use the RecursiveChunker
via the API, check out the API reference documentation.
Installation
The RecursiveChunker is included in the base installation of Chonkie. No additional dependencies are required.
If you would like to use custom tokenizers in JavaScript, please install the
@chonkiejs/token
library
Initialization
The RecursiveChunker uses RecursiveRules
to determine how to chunk the text.
The rules are a list of RecursiveLevel
objects, which define the delimiters and whitespace rules for each level of the recursive tree.
Find more information about the rules in the Additional Information section.
from chonkie import RecursiveChunker, RecursiveRules
chunker = RecursiveChunker(
tokenizer: Union[str, Callable, Any] = "character",
chunk_size: int = 2048,
rules: RecursiveRules = RecursiveRules(),
min_characters_per_chunk: int = 24,
)
You can also initialize the RecursiveChunker using a recipe. Recipes are pre-defined rules for common chunking tasks.
Find all available recipes on our Hugging Face Hub here.
Recipes are supported on Python only
from chonkie import RecursiveChunker
# Initialize the recursive chunker to chunk Markdown
chunker = RecursiveChunker.from_recipe("markdown", lang="en")
# Initialize the recursive chunker to chunk Hindi texts
chunker = RecursiveChunker.from_recipe(lang="hi")
Parameters
tokenizer
Union[str, Callable, Any]
default:"character"
Tokenizer to use. Can be a string identifier or a tokenizer instance
Maximum number of tokens per chunk
rules
RecursiveRules
default:"RecursiveRules()"
Rules to use for chunking.
min_characters_per_chunk / minCharactersPerChunk
Minimum number of characters per chunk
Usage
Single Text Chunking
text = """This is the first sentence. This is the second sentence.
And here's a third one with some additional context."""
chunks = chunker.chunk(text)
for chunk in chunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
Batch Chunking
texts = [
"This is the first sentence. This is the second sentence.
And here's a third one with some additional context.",
"This is the first sentence. This is the second sentence.
And here's a third one with some additional context.",
]
chunks = chunker.chunk_batch(texts)
for chunk in chunks:
print(f"Chunk text: {chunk.text}")
print(f"Token count: {chunk.token_count}")
Using as a Callable
# Single text
chunks = chunker("This is the first sentence. This is the second sentence.")
# Multiple texts
batch_chunks = chunker(["Text 1. More text.", "Text 2. More."])
Return Type
The RecursiveChunker returns chunks as Chunk
objects:
@dataclass
class Chunk:
text: str # The chunk text
start_index: int # Starting position in original text
end_index: int # Ending position in original text
token_count: int # Number of tokens in Chunk
The RecursiveChunker uses the RecursiveRules
class to determine the chunking rules. The rules are a list of RecursiveLevel
objects, which define the delimiters and whitespace rules for each level of the recursive tree.
@dataclass
class RecursiveRules:
rules: List[RecursiveLevel]
@dataclass
class RecursiveLevel:
delimiters: Optional[Union[str, List[str]]]
whitespace: bool = False
include_delim: Optional[Literal["prev", "next"]]) # Whether to include the delimiter at all, or in the previous chunk, or the next chunk.
You can pass in custom rules to the RecursiveChunker, or use the default rules. The default rules are designed to be a good starting point for most documents, but you can customize them to your needs.
RecursiveLevel
expects the list of custom delimiters to not include
whitespace. If whitespace as a delimiter is required, you can set the
whitespace
parameter in the RecursiveLevel
class to True. Note that if
whitespace = True
, you cannot pass a list of custom delimiters.