v1.0.6

v1.0.6 Release Highlights ✨

  • New SlumberChunker: Welcome Chonkie’s very own agentic chunker! Requires the genie optional install and a GEMINI_API_KEY. It leverages Genie, Chonkie’s interface for generative models.
pip install "chonkie[genie]"
# Import
from chonkie import SlumberChunker

# Initialize
chunker  = SlumberChunker(verbose=True) # set verbose to True, since it takes a while~

# CHONK!
chunker(text)
  • New NeuralChunker: Introducing a fully neural approach to chunking! Requires the neural optional install. This uses a fine-tuned BERT-like model for fast, high-quality chunking.
pip install "chonkie[neural]"
# import
from chonkie import NeuralChunker

# initialize
chunker = NeuralChunker()

# CHONK!
chunks = chunker(text)
  • auto Language Detection for CodeChunker: CodeChunker can now automatically detect the programming language. Specify the language manually if performance is critical.
# Import
from chonkie import CodeChunker

# Initialize the "auto" CodeChunker
chunker = CodeChunker() # No need to specify, "auto" by default

# CHONK!
chunks = chunker(code)
  • Introducing Genies: Added Genie to power SlumberChunker and future generative features. Genies are Chonkie’s way to handle multiple generative APIs and model interfaces. The first is GeminiGenie, requiring the genie optional install.
pip install "chonkie[genie]"
# Import
from chonkie import GeminiGenie

# Init
genie = GeminiGenie(api_key=YOUR_API_KEY)

# generate
genie.generate("Hi!")

# generate JSON
genie.generate_json("Hi", JSON_SCHEMA)

Full Changelog: https://github.com/chonkie-inc/chonkie/compare/v1.0.5…v1.0.6

v1.0.5

v1.0.5 Release Highlights ✨

This is a quick patch release to include CodeChunker in the __init__.py for chonkie so it can be properly accessed via from chonkie import CodeChunker.

Full Changelog: https://github.com/chonkie-inc/chonkie/compare/v1.0.4…v1.0.5

v1.0.4

v1.0.4 Release Highlights ✨

  • New CodeChunker: Introducing the CodeChunker, specialized for handling code files across 100+ programming languages. It understands code structure to provide more meaningful chunks.
pip install "chonkie[code]"
# Initialize the code chunker
chunker = CodeChunker(language="python")

# Chunk the code
code = ... # Your code string

# CHONK!
chunks = chunker(code)
  • JinaAI Embeddings Support: Added JinaEmbeddings, enabling their use with SemanticChunker and SDPMChunker. Just install the jina optional install to use it!
pip install "chonkie[jina]"
# Initialize the Jina embeddings  
from chonkie import JinaEmbeddings, SemanticChunker

# Initialize the Jina embeddings
embeddings = JinaEmbeddings()

# Initialize the semantic chunker
chunker = SemanticChunker(embeddings)

# Chunk the text
text = ... # Your text string

# CHONK!
chunks = chunker(text)
  • OverlapRefinery: Enhance your chunks by adding overlapping context using the new OverlapRefinery. It’s included in the default install and works seamlessly with any chunker.
# Initialize the recursive chunker  
from chonkie import RecursiveChunker, OverlapRefinery
chunker = RecursiveChunker()

# Initialize the overlap refinery
refinery = OverlapRefinery() # Or OverlapRefinery("gpt2")

# Chunk the text
text = ... # Your text string

# CHONK!
chunks = chunker(text)

# Refine the chunks
chunks = refinery(chunks)
  • EmbeddingsRefinery: Compute and attach embeddings directly to your chunks using the EmbeddingsRefinery. Streamline the process of loading chunks into vector databases.
from chonkie import RecursiveChunker, EmbeddingsRefinery, JinaEmbeddings

# Initialize the recursive chunker
chunker = RecursiveChunker()

# Initialize the embeddings model
# Here we use Jina embeddings for this example, but you can use any other embeddings model
embeddings = JinaEmbeddings()

# Initialize the embeddings refinery
refinery = EmbeddingsRefinery(embeddings)

# Chunk the text
text = ... # Your text string
chunks = chunker(text)
chunks = refinery(chunks) # Each chunk now has a .embedding attribute

Full Changelog: https://github.com/chonkie-inc/chonkie/compare/v1.0.3…v1.0.4

v1.0.3

v1.0.3 Release Highlights ✨

  • Chonkie Visualizer: Visualize and debug chunks easily via terminal printouts or HTML saves. Understand chunk quality and debug your chunker with visual feedback~ Use the print method to print rich text on your terminal or use the save method to save a highlighted html on your device! It’s very simple to use, just pass in your chunks~

    from chonkie import Visualizer
    
    viz = Visualizer()
    
    # Print the chunks on the terminal with .print or directly call the Visualizer object too
    viz.print(chunks) 
    
    # Save the HTML file
    viz.save("chonkie.html", chunks)
    
  • Recipes: Chonkie now adds support for Recipes which allow you to use multilingual chunking out-of-the-box, as well as document specific chunking methods. Initial support starts with: en, hi, zh, jp and ko, while document type markdown is supported too. Use it via the from_recipe class method with any chunker that takes delimiters or RecursiveRules.

    from chonkie import RecursiveChunker
    
    # Initialize the recursive chunker to chunk Markdown
    chunker = RecursiveChunker.from_recipe("markdown", lang="en")
    
    # Initialize the recursive chunker to chunk Hindi texts
    chunker = RecursiveChunker.from_recipe(lang="hi")
    
  • Performance enhancements in RecursiveChunker, SentenceChunker, and WordTokenizer.

Full Changelog: https://github.com/chonkie-inc/chonkie/compare/v1.0.2…v1.0.3