OverlapRefinery
Refine chunks by adding overlapping context from adjacent chunks.
The OverlapRefinery
enhances chunks by incorporating context from neighboring chunks. This is useful for tasks where maintaining contextual continuity between chunks is important, such as question answering or summarization over long documents. It can add context as a prefix (from the preceding chunk) or a suffix (from the next chunk).
API Reference
To use the OverlapRefinery
via the API, check out the API reference documentation.
Initialization
To use the OverlapRefinery
, initialize it with the desired parameters. You can specify a tokenizer, context size, overlap mode, method, and other options.
Usage
Use the OverlapRefinery
object as a callable or use the refine
method to add overlapping context to your chunks.
Parameters
The tokenizer or token counter to use for calculating overlap size. Can be a string identifier (e.g., “gpt2”), a callable, or a chonkie.Tokenizer
instance. Defaults to character counting.
The size of the overlap context. If an int
, it’s the absolute number of tokens. If a float
(between 0 and 1), it’s the fraction of the maximum chunk token count.
The mode for calculating overlap. "token"
uses the tokenizer directly. "recursive"
uses hierarchical splitting based on rules
.
The method for adding context. "suffix"
adds context from the next chunk to the end of the current chunk. "prefix"
adds context from the previous chunk to the beginning of the current chunk.
The rules used for splitting text when mode
is "recursive"
. Defines delimiters and behavior at different hierarchical levels. See chonkie.types.RecursiveRules
.
If True
, the calculated context is directly prepended (for prefix
) or appended (for suffix
) to the chunk.text
. If False
, the context is stored in chunk.context
attribute without modifying chunk.text
.
If True
, modifies the input list of chunks directly. If False
, returns a new list of modified chunks.