LateChunker is based on the paper Late Chunking, which uses a long-context embedding model to first chunk such that the entire document is within the context window. Then, it splits appart the embeddings into chunks of a specified size, either by token chunking or sentence chunking.

Installation

LateChunker requires the sentence-transformers library to be installed, and currently only supports SentenceTransformer models. You can install it with:

pip install "chonkie[st]"
For installation instructions, see the Installation Guide.

Initialization

from chonkie import LateChunker

chunker = LateChunker(
    embedding_model="all-MiniLM-L6-v2",
    mode = "sentence",
    chunk_size=512,
    min_sentences_per_chunk=1,
    min_characters_per_sentence=12,
)

Parameters

embedding_model
str
default:
"all-MiniLM-L6-v2"

SentenceTransformer model to use for embedding

mode
str
default:
"sentence"

Mode to use for chunking. Can be “sentence” or “token”

chunk_size
int
default:
"512"

Maximum number of tokens per chunk

min_sentences_per_chunk
int
default:
"1"

Minimum number of sentences per chunk

min_characters_per_sentence
int
default:
"12"

Minimum number of characters per sentence

approximate
bool
default:
"True"

Whether to use approximate chunking

delim
list[str]
default:
"['.', '!', '?', '\\n']"

Delimiters to use for chunking

Usage

Single Text Chunking

text = """First paragraph about a specific topic.
Second paragraph continuing the same topic.
Third paragraph switching to a different topic.
Fourth paragraph expanding on the new topic."""

chunks = chunker(text)

for chunk in chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Batch Chunking

texts = [
    "First document about topic A...",
    "Second document about topic B..."
]

batch_chunks = chunker(texts)

for chunk in batch_chunks:
    print(f"Chunk text: {chunk.text}")
    print(f"Token count: {chunk.token_count}")
    print(f"Number of sentences: {len(chunk.sentences)}")

Return Type

LateChunker returns LateChunk objects with optimized storage using slots:

@dataclass
class LateChunk(SentenceChunk):
    text: str
    start_index: int
    end_index: int
    token_count: int
    sentences: List[Sentence]
    embedding: Optional[np.ndarray]  # Sentence embedding vector