The OverlapRefinery enhances chunks by incorporating context from neighboring chunks. This is useful for tasks where maintaining contextual continuity between chunks is important, such as question answering or summarization over long documents. It can add context as a prefix (from the preceding chunk) or a suffix (from the next chunk).

API Reference

To use the OverlapRefinery via the API, check out the API reference documentation.

Initialization

To use the OverlapRefinery, initialize it with the desired parameters. You can specify a tokenizer, context size, overlap mode, method, and other options.

from chonkie import OverlapRefinery

# Initialize with default character-level overlap (25% context size)
overlap_refinery = OverlapRefinery()

# Initialize with a specific tokenizer and context size
overlap_refinery_token = OverlapRefinery(
    tokenizer_or_token_counter="gpt2", # Can be string identifier, callable, or Tokenizer instance
    context_size=0.25,                 # The size of the context to add to the chunks.
    method="prefix",                   # Add context from the previous chunk
    merge=True                         # Merge context directly into chunk text
)

# Initialize for recursive overlap based on rules
from chonkie import RecursiveRules, RecursiveLevel
rules = RecursiveRules(
    levels=[
        RecursiveLevel(delimiters=["\n\n"], include_delim="prev"),
        RecursiveLevel(delimiters=["."], include_delim="prev"),
        RecursiveLevel(whitespace=True)
    ]
)
overlap_refinery_recursive = OverlapRefinery(
    tokenizer_or_token_counter="gpt2",
    context_size=0.25,
    mode="recursive",
    rules=rules,
    method="suffix"
)

Usage

Use the OverlapRefinery object as a callable or use the refine method to add overlapping context to your chunks.

from chonkie import TokenChunker, OverlapRefinery

test_string = "This is the first sentence. This is the second sentence, providing context. This is the third sentence, which needs context from the second."
chunker = TokenChunker()
chunks = chunker(test_string)

# Initialize refinery to add suffix overlap
overlap_refinery = OverlapRefinery(
    tokenizer_or_token_counter="gpt2",
    context_size=0.5,
    method="suffix",
    merge=True
)

refined_chunks = overlap_refinery(chunks)

Parameters

tokenizer_or_token_counter
Union[str, Callable, Any]
default:"\"character\""

The tokenizer or token counter to use for calculating overlap size. Can be a string identifier (e.g., “gpt2”), a callable, or a chonkie.Tokenizer instance. Defaults to character counting.

context_size
Union[int, float]
default:"0.25"

The size of the overlap context. If an int, it’s the absolute number of tokens. If a float (between 0 and 1), it’s the fraction of the maximum chunk token count.

mode
Literal["token", "recursive"]
default:"\"token\""

The mode for calculating overlap. "token" uses the tokenizer directly. "recursive" uses hierarchical splitting based on rules.

method
Literal["suffix", "prefix"]
default:"\"suffix\""

The method for adding context. "suffix" adds context from the next chunk to the end of the current chunk. "prefix" adds context from the previous chunk to the beginning of the current chunk.

rules
RecursiveRules
default:"RecursiveRules()"

The rules used for splitting text when mode is "recursive". Defines delimiters and behavior at different hierarchical levels. See chonkie.types.RecursiveRules.

merge
bool
default:"True"

If True, the calculated context is directly prepended (for prefix) or appended (for suffix) to the chunk.text. If False, the context is stored in chunk.context attribute without modifying chunk.text.

inplace
bool
default:"True"

If True, modifies the input list of chunks directly. If False, returns a new list of modified chunks.