RecursiveChunker and uses document-level embeddings to create more semantically rich chunk representations.
Instead of generating embeddings for each chunk independently, the LateChunker first encodes the entire text into a single embedding.
It then splits the text using recursive rules and derives each chunk’s embedding by averaging relevant parts of the
full document embedding. This allows each chunk to carry broader contextual information,
improving retrieval performance in RAG systems.
API Reference
To use theLateChunker via the API, check out the API reference documentation.
Installation
LateChunker requires thesentence-transformers library to be installed, and currently only supports SentenceTransformer models.
You can install it with:
The LateChunker uses RecursiveRules to determine how to chunk the text.
The rules are a list of RecursiveLevel objects, which define the delimiters and whitespace rules for each level of the recursive tree.
Find more information about the rules in the Additional Information section.
For installation instructions, see the Installation
Guide.
Initialization
Parameters
SentenceTransformer model to use for embedding
Maximum number of tokens per chunk
Rules to use for chunking
Minimum number of characters per sentence
Usage
Single Text Chunking
Batch Chunking
Return Type
LateChunker returns chunks asChunk objects:
As of version 1.3.0, LateChunker returns the base
Chunk type instead of the
specialized LateChunk type. The embedding is automatically populated by the
LateChunker during the chunking process.Additional Information
LateChunker uses theRecursiveRules class to determine the chunking rules.
The rules are a list of RecursiveLevel objects, which define the delimiters and whitespace rules for each level of the recursive tree.
RecursiveLevel expects the list of custom delimiters to not include
whitespace. If whitespace as a delimiter is required, you can set the
whitespace parameter in the RecursiveLevel class to True. Note that if
whitespace = True, you cannot pass a list of custom delimiters.