Chonkie Documentation

The OverlapRefinery enhances chunks by incorporating context from neighboring chunks. This is useful for tasks where maintaining contextual continuity between chunks is important, such as question answering or summarization over long documents. It can add context as a prefix (from the preceding chunk) or a suffix (from the next chunk).

API Reference

To use the OverlapRefinery via the API, check out the API reference documentation.

Initialization

To use the OverlapRefinery, initialize it with the desired parameters. You can specify a tokenizer, context size, overlap mode, method, and other options.

from chonkie import OverlapRefinery

# Initialize with default character-level overlap (25% context size)
overlap_refinery = OverlapRefinery()

# Initialize with a specific tokenizer and context size
overlap_refinery_token = OverlapRefinery(
    tokenizer="character",             # Default tokenizer (or use "gpt2", etc.)
    context_size=0.25,                 # The size of the context to add to the chunks.
    method="prefix",                   # Add context from the previous chunk
    merge=True                         # Merge context directly into chunk text
)

# Initialize for recursive overlap based on rules
from chonkie import RecursiveRules, RecursiveLevel
rules = RecursiveRules(
    levels=[
        RecursiveLevel(delimiters=["\n\n"], include_delim="prev"),
        RecursiveLevel(delimiters=["."], include_delim="prev"),
        RecursiveLevel(whitespace=True)
    ]
)
overlap_refinery_recursive = OverlapRefinery(
    tokenizer="character",
    context_size=0.25,
    mode="recursive",
    rules=rules,
    method="suffix"
)

Usage

Use the OverlapRefinery object as a callable or use the refine method to add overlapping context to your chunks.

from chonkie import TokenChunker, OverlapRefinery

test_string = "This is the first sentence. This is the second sentence, providing context. This is the third sentence, which needs context from the second."
chunker = TokenChunker()
chunks = chunker(test_string)

# Initialize refinery to add suffix overlap
overlap_refinery = OverlapRefinery(
    tokenizer="character",
    context_size=0.5,
    method="suffix",
    merge=True
)

refined_chunks = overlap_refinery(chunks)

Parameters

tokenizer

Union[str, Callable, Any]

default:"\"character\""

The tokenizer to use for calculating overlap size. Can be a string identifier (e.g., “character”, “word”, “gpt2”), a callable, or a chonkie.Tokenizer instance. Defaults to “character”.

context_size

Union[int, float]

default:"0.25"

The size of the overlap context. If an int, it’s the absolute number of tokens. If a float (between 0 and 1), it’s the fraction of the maximum chunk token count.

mode

Literal["token", "recursive"]

default:"\"token\""

The mode for calculating overlap. "token" uses the tokenizer directly. "recursive" uses hierarchical splitting based on rules.

method

Literal["suffix", "prefix"]

default:"\"suffix\""

The method for adding context. "suffix" adds context from the next chunk to the end of the current chunk. "prefix" adds context from the previous chunk to the beginning of the current chunk.

rules

RecursiveRules

default:"RecursiveRules()"

The rules used for splitting text when mode is "recursive". Defines delimiters and behavior at different hierarchical levels. See chonkie.types.RecursiveRules.

merge

bool

default:"True"

If True, the calculated context is directly prepended (for prefix) or appended (for suffix) to the chunk.text. If False, the context is stored in chunk.context attribute without modifying chunk.text.

inplace

bool

default:"True"

If True, modifies the input list of chunks directly. If False, returns a new list of modified chunks.

Getting Started

Chefs

Fetchers

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Deprecated

Changelog

Overlap Refinery

API Reference

Initialization

Usage

Parameters

Getting Started

Chefs

Fetchers

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Deprecated

Changelog

​API Reference

​Initialization

​Usage

​Parameters

API Reference

Initialization

Usage

Parameters