> ## Documentation Index
> Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Semantic Chunker

> Split text into chunks based on semantic similarity with advanced features

The `SemanticChunker` splits text into chunks based on semantic similarity, ensuring that related content stays together in the same chunk. This chunker now includes advanced features like Savitzky-Golay filtering for smoother boundary detection and skip-window merging for connecting related content that may not be consecutive. This chunker is inspired by the work of [Greg Kamradt](https://github.com/gkamradt).

## API Reference

To use the `SemanticChunker` via the API, check out the [API reference documentation](../../api/chunkers/semantic-chunker).

## Installation

SemanticChunker requires additional dependencies for semantic capabilities. You can install it with:

<CodeGroup>
  ```bash Python theme={"system"}
  pip install "chonkie[semantic]"
  ```

  ```bash JavaScript theme={"system"}
  npm install @chonkiejs/core
  ```
</CodeGroup>

<Info>For installation instructions, see the [Installation Guide](/oss/installation).</Info>

## Initialization

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    from chonkie import SemanticChunker

    # Basic initialization with default parameters
    chunker = SemanticChunker(
        embedding_model="minishlab/potion-base-32M",  # Default model
        threshold=0.8,                               # Similarity threshold (0-1)
        chunk_size=2048,                             # Maximum tokens per chunk
        similarity_window=3,                         # Window for similarity calculation
        skip_window=0                                # Skip-and-merge window (0=disabled)
    )

    # With skip-and-merge enabled (similar to legacy SDPM behavior)
    chunker = SemanticChunker(
        embedding_model="minishlab/potion-base-32M",
        threshold=0.7,
        chunk_size=2048,
        skip_window=1  # Enable merging of similar non-consecutive groups
    )
    ```
  </Tab>

  <Tab title="JavaScript">
    ```javascript theme={"system"}
    import { SemanticChunker } from "@chonkiejs/core";

    // Basic initialization with custom embedding function
    const embedFn = async (texts) => {
      // Your embedding logic here
      // Return array of embeddings for each text
    };

    const chunker = await SemanticChunker.create({
      embedFunction: embedFn,      // Custom embedding function
      threshold: 0.8,              // Similarity threshold (0-1)
      chunkSize: 2048,             // Maximum tokens per chunk
      similarityWindow: 3,         // Window for similarity calculation
      skipWindow: 0                // Skip-and-merge window (0=disabled)
    });
    ```
  </Tab>
</Tabs>

## Parameters

<ParamField path="embedding_model" type="Union[str, BaseEmbeddings]" default="minishlab/potion-base-32M">
  Model identifier or embedding model instance
</ParamField>

<ParamField path="threshold" type="float" default="0.8">
  Similarity threshold for grouping sentences (0-1). Lower values create larger groups.
</ParamField>

<ParamField path="chunk_size" type="int" default="2048">
  Maximum tokens per chunk
</ParamField>

<ParamField path="similarity_window" type="int" default="3">
  Number of sentences to consider for similarity calculation
</ParamField>

<ParamField path="min_sentences_per_chunk" type="int" default="1">
  Minimum number of sentences per chunk
</ParamField>

<ParamField path="min_characters_per_sentence" type="int" default="24">
  Minimum number of characters per sentence
</ParamField>

<ParamField path="skip_window" type="int" default="0">
  Number of groups to skip when looking for similar content to merge.

  * `0` (default): No skip-and-merge, uses standard semantic grouping
  * `1` or higher: Enables merging of semantically similar groups within the skip window

  This feature allows the chunker to connect related content that may not be consecutive in the text.
</ParamField>

<ParamField path="filter_window" type="int" default="5">
  Window length for the Savitzky-Golay filter used in boundary detection
</ParamField>

<ParamField path="filter_polyorder" type="int" default="3">
  Polynomial order for the Savitzky-Golay filter
</ParamField>

<ParamField path="filter_tolerance" type="float" default="0.2">
  Tolerance for the Savitzky-Golay filter boundary detection
</ParamField>

<ParamField path="delim" type="Union[str, list[str]]" default="[&#x22;. &#x22;, &#x22;! &#x22;, &#x22;? &#x22;, &#x22;\n&#x22;]">
  Delimiters to split sentences on
</ParamField>

<ParamField path="include_delim" type="Optional[Literal[&#x22;prev&#x22;, &#x22;next&#x22;]]" default="prev">
  Include delimiters in the chunk text. Specify whether to include with the previous or next sentence.
</ParamField>

## Basic Usage

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    from chonkie import SemanticChunker

    # Initialize with semantic similarity grouping
    chunker = SemanticChunker(
        embedding_model="minishlab/potion-base-32M",
        threshold=0.7,  # Similarity threshold
        chunk_size=512
    )

    text = """Your document text with multiple topics and themes..."""
    chunks = chunker.chunk(text)

    # Process chunks
    for chunk in chunks:
        print(f"Chunk: {chunk.text[:50]}...")
        print(f"Tokens: {chunk.token_count}")
    ```
  </Tab>

  <Tab title="JavaScript">
    ```javascript theme={"system"}
    import { SemanticChunker } from "@chonkiejs/core";

    // Define custom embedding function
    const embedFn = async (texts) => {
      // Your embedding logic here (e.g., call to an API)
      // Return array of embeddings for each text
    };

    // Initialize with semantic similarity grouping
    const chunker = await SemanticChunker.create({
      embedFunction: embedFn,
      threshold: 0.7,  // Similarity threshold
      chunkSize: 512
    });

    const text = "Your document text with multiple topics and themes...";
    const chunks = await chunker.chunk(text);

    // Process chunks
    for (const chunk of chunks) {
      console.log(`Chunk: ${chunk.text.slice(0, 50)}...`);
      console.log(`Tokens: ${chunk.tokenCount}`);
    }
    ```
  </Tab>
</Tabs>

## Examples

<AccordionGroup>
  <Accordion title="Basic Semantic Chunking">
    ```python theme={"system"}
    from chonkie import SemanticChunker

    text = """Artificial intelligence is transforming industries worldwide. 
    Machine learning algorithms can now process vast amounts of data efficiently.
    Deep learning models have achieved remarkable accuracy in complex tasks.

    Climate change poses significant challenges to our planet.
    Rising temperatures affect ecosystems and biodiversity globally.
    Sustainable practices are essential for environmental preservation.

    Quantum computing represents a paradigm shift in computation.
    These systems leverage quantum mechanical phenomena for processing.
    Potential applications include cryptography and drug discovery."""

    # Create semantic chunker
    chunker = SemanticChunker(
        embedding_model="minishlab/potion-base-32M",
        threshold=0.75,  # Higher threshold = more similar content grouped
        chunk_size=1024
    )

    chunks = chunker.chunk(text)

    # Analyze semantic groupings
    for i, chunk in enumerate(chunks):
        print(f"\n--- Semantic Group {i+1} ---")
        print(f"Content: {chunk.text[:100]}...")
        print(f"Token count: {chunk.token_count}")
        print(f"Theme: {chunk.text.split('.')[0]}")  # First sentence as theme indicator
    ```
  </Accordion>

  <Accordion title="Skip-Window Merging">
    ```python theme={"system"}
    from chonkie import SemanticChunker

    # Text with alternating topics
    text = """Neural networks process information through interconnected nodes.
    The stock market experienced significant volatility this quarter.
    Deep learning models require substantial training data for optimization.
    Economic indicators point to potential recession risks ahead.
    GPU acceleration has revolutionized machine learning computations.
    Federal reserve policies impact global financial markets.
    Transformer architectures dominate modern NLP applications.
    Cryptocurrency markets show correlation with traditional assets."""

    # Enable skip-window to merge non-consecutive similar content
    chunker = SemanticChunker(
        embedding_model="minishlab/potion-base-32M",
        threshold=0.65,
        chunk_size=512,
        skip_window=2  # Look ahead 2 groups for similar content
    )

    chunks = chunker.chunk(text)

    # AI-related content will be grouped together
    # Financial content will be grouped separately
    for i, chunk in enumerate(chunks):
        print(f"\nGroup {i+1}: {len(chunk.text.split('.'))} sentences")
        print(f"Preview: {chunk.text[:80]}...")
    ```
  </Accordion>

  <Accordion title="Fine-tuned Similarity Control">
    ```python theme={"system"}
    from chonkie import SemanticChunker

    text = """Your comprehensive document with various topics..."""

    # Experiment with different thresholds
    thresholds = [0.5, 0.7, 0.9]

    for threshold in thresholds:
        chunker = SemanticChunker(
            embedding_model="minishlab/potion-base-32M",
            threshold=threshold,
            chunk_size=512,
            similarity_window=3  # Consider 3 sentences for similarity
        )
        
        chunks = chunker.chunk(text)
        print(f"\nThreshold {threshold}: {len(chunks)} chunks created")
        
        # Lower threshold = larger, more diverse chunks
        # Higher threshold = smaller, more focused chunks
        avg_size = sum(c.token_count for c in chunks) / len(chunks)
        print(f"Average chunk size: {avg_size:.1f} tokens")
    ```
  </Accordion>

  <Accordion title="Batch Document Processing">
    ```python theme={"system"}
    from chonkie import SemanticChunker

    # Initialize chunker once
    chunker = SemanticChunker(
        embedding_model="minishlab/potion-base-32M",
        threshold=0.7,
        chunk_size=1024,
        min_sentences_per_chunk=2  # Ensure meaningful chunks
    )

    # Multiple documents with different topics
    documents = [
        """Document about artificial intelligence and machine learning...""",
        """Document about climate change and environmental science...""",
        """Document about quantum computing and physics..."""
    ]

    # Process all documents
    batch_results = chunker.chunk_batch(documents)

    # Analyze results
    for doc_idx, chunks in enumerate(batch_results):
        print(f"\nDocument {doc_idx + 1}:")
        print(f"  Total chunks: {len(chunks)}")
        print(f"  Total tokens: {sum(c.token_count for c in chunks)}")
        
        # Show semantic boundaries
        for i, chunk in enumerate(chunks):
            first_sentence = chunk.text.split('.')[0]
            print(f"  Chunk {i+1}: {first_sentence[:50]}...")
    ```
  </Accordion>

  <Accordion title="Custom Embeddings Integration">
    ```python theme={"system"}
    from chonkie import SemanticChunker
    from chonkie.embeddings import AutoEmbeddings

    # Use AutoEmbeddings for automatic model selection
    embeddings = AutoEmbeddings.get_embeddings(
        model="sentence-transformers/all-MiniLM-L6-v2"
    )

    chunker = SemanticChunker(
        embedding_model=embeddings,
        threshold=0.8,
        chunk_size=512
    )

    # Or use specific embedding providers
    from chonkie.embeddings import OpenAIEmbeddings

    openai_embeddings = OpenAIEmbeddings(
        model="text-embedding-ada-002"
    )

    chunker = SemanticChunker(
        embedding_model=openai_embeddings,
        threshold=0.75,
        chunk_size=1024
    )

    text = "Your text to chunk with custom embeddings..."
    chunks = chunker.chunk(text)
    ```
  </Accordion>

  <Accordion title="Advanced Filtering Options">
    ```python theme={"system"}
    from chonkie import SemanticChunker

    # Configure Savitzky-Golay filter for smoother boundaries
    chunker = SemanticChunker(
        embedding_model="minishlab/potion-base-32M",
        threshold=0.7,
        chunk_size=512,
        filter_window=7,      # Larger window for smoother filtering
        filter_polyorder=4,   # Higher order polynomial
        filter_tolerance=0.15 # Stricter boundary detection
    )

    text = """Complex document with subtle topic transitions..."""
    chunks = chunker.chunk(text)

    # The filtering helps identify more natural semantic boundaries
    # especially in documents with gradual topic shifts
    for chunk in chunks:
        print(f"Smooth boundary chunk: {chunk.text[:60]}...")
    ```
  </Accordion>

  <Accordion title="Sentence Configuration">
    ```python theme={"system"}
    from chonkie import SemanticChunker

    # Customize sentence detection
    chunker = SemanticChunker(
        embedding_model="minishlab/potion-base-32M",
        threshold=0.7,
        chunk_size=1024,
        min_sentences_per_chunk=3,   # At least 3 sentences per chunk
        min_characters_per_sentence=30,  # Filter out short fragments
        delim=[". ", "! ", "? ", "\n\n"],  # Custom sentence delimiters
        include_delim="prev"  # Include delimiter with previous sentence
    )

    # Text with various sentence structures
    text = """Short sentence. This is a much longer sentence with more detail.
    Question here? Exclamation point! New paragraph starts here.

    Another paragraph with different content..."""

    chunks = chunker.chunk(text)

    for chunk in chunks:
        sentences = chunk.text.split('. ')
        print(f"Chunk with {len(sentences)} sentences")
    ```
  </Accordion>

  <Accordion title="RAG Pipeline Integration">
    ```python theme={"system"}
    from chonkie import SemanticChunker
    from chonkie.refinery import OverlapRefinery, EmbeddingsRefinery

    # Create semantic chunker
    chunker = SemanticChunker(
        embedding_model="minishlab/potion-base-32M",
        threshold=0.7,
        chunk_size=512
    )

    # Add refineries for RAG optimization
    overlap_refinery = OverlapRefinery(overlap_size=50)
    embeddings_refinery = EmbeddingsRefinery(
        embedding_model="minishlab/potion-base-32M"
    )

    # Process document
    text = """Your document for RAG system..."""
    chunks = chunker.chunk(text)

    # Apply refinements
    chunks = overlap_refinery.refine(chunks)
    chunks = embeddings_refinery.refine(chunks)  # Add embeddings

    # Ready for vector database
    for chunk in chunks:
        print(f"Chunk ready for indexing: {chunk.text[:50]}...")
        if chunk.embedding is not None:
            print(f"  Embedding shape: {chunk.embedding.shape}")
    ```
  </Accordion>
</AccordionGroup>

## Advanced Features

### Savitzky-Golay Filtering

The SemanticChunker uses Savitzky-Golay filtering for smoother boundary detection in similarity curves. This reduces noise in the semantic similarity signal and provides more stable chunk boundaries.

### Skip-Window Merging

When `skip_window > 0`, the chunker can merge semantically similar groups that are not consecutive. This is useful for:

* Documents with alternating topics
* Content with recurring themes
* Technical documents with distributed related sections

## Supported Embeddings

SemanticChunker supports multiple embedding providers through Chonkie's embedding system. See the [Embeddings Overview](/python-sdk/embeddings/overview) for more information.

## Return Type

SemanticChunker returns `Chunk` objects:

```python theme={"system"}
@dataclass
class Chunk:
    text: str
    start_index: int
    end_index: int
    token_count: int
```
