LateChunker
Split text into chunks based on a late-bound token count
LateChunker is based on the paper Late Chunking, which uses a long-context embedding model to first chunk such that the entire document is within the context window. Then, it splits appart the embeddings into chunks of a specified size, either by token chunking or sentence chunking.
Installation
LateChunker requires the sentence-transformers
library to be installed, and currently only supports SentenceTransformer models.
You can install it with:
Initialization
Parameters
SentenceTransformer model to use for embedding
Maximum number of tokens per chunk
Rules to use for chunking
Minimum number of characters per sentence
Usage
Single Text Chunking
Batch Chunking
Return Type
LateChunker returns LateChunk
objects with optimized storage using slots:
Additional Information
LateChunker uses the RecursiveRules
class to determine the chunking rules.
The rules are a list of RecursiveLevel
objects, which define the delimiters and whitespace rules for each level of the recursive tree.
You can pass in custom rules to the RecursiveChunker, or use the default ones. Default rules are designed to be a good starting point for most documents, but you can customize them to your needs.
RecursiveLevel
expects the list of custom delimiters to not include whitespace.
If whitespace as a delimiter is required, you can set the whitespace
parameter in the RecursiveLevel
class to True.
Note that if whitespace = True
, you cannot pass a list of custom delimiters.Was this page helpful?