LateChunker
Split text into chunks based on a late-bound token count
LateChunker is based on the paper Late Chunking, which uses a long-context embedding model to first chunk such that the entire document is within the context window. Then, it splits appart the embeddings into chunks of a specified size, either by token chunking or sentence chunking.
Installation
LateChunker requires the sentence-transformers
library to be installed, and currently only supports SentenceTransformer models.
You can install it with:
Initialization
Parameters
SentenceTransformer model to use for embedding
Mode to use for chunking. Can be “sentence” or “token”
Maximum number of tokens per chunk
Minimum number of sentences per chunk
Minimum number of characters per sentence
Whether to use approximate chunking
Delimiters to use for chunking
Usage
Single Text Chunking
Batch Chunking
Return Type
LateChunker returns LateChunk
objects with optimized storage using slots:
Was this page helpful?