Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt

Use this file to discover all available pages before exploring further.

The MistralOCR chef extracts text from images and PDF files using Mistral’s OCR API, returning structured MarkdownDocument objects for further processing.

Installation

pip install chonkie[mistral]
You need a Mistral API key. Set the MISTRAL_API_KEY environment variable or pass it directly.

Initialization

from chonkie import MistralOCR

# Default initialization (uses MISTRAL_API_KEY env var)
ocr = MistralOCR()

# Custom model and explicit API key
ocr = MistralOCR(model="mistral-ocr-2505", api_key="sk-...")

Parameters

model
str
default:"mistral-ocr-latest"
The Mistral OCR model to use.
api_key
Optional[str]
default:"None"
Mistral API key. Falls back to the MISTRAL_API_KEY environment variable.

Methods

process()

Process an image or PDF file and return a MarkdownDocument.

Parameters

path
Union[str, Path]
required
Path to the image or PDF file.

Returns

MarkdownDocument containing the extracted text as markdown content.

process_batch()

Process multiple image or PDF files at once.

Parameters

paths
list[Union[str, Path]]
required
List of file paths to process.

Returns

list[MarkdownDocument] where each document contains extracted text from a file.

parse()

Parse raw text into a Document (wraps text as-is, since OCR operates on files).

Parameters

text
str
required
Raw text to wrap into a Document.

Returns

Document containing the provided text.

Supported File Types

TypeExtensions
Images.png, .jpg, .jpeg, .gif, .bmp, .webp, .tiff, .tif
Documents.pdf

Usage

Standalone

from chonkie import MistralOCR

ocr = MistralOCR()

# Single file
doc = ocr.process("research_paper.pdf")
print(doc.content)
print(f"Source: {doc.metadata['filename']}")

# Multiple files
docs = ocr.process_batch(["page1.png", "page2.png"])

# Async
import asyncio
doc = asyncio.run(ocr.aprocess("document.pdf"))

Pipeline

Use .process_with("mistral") to add OCR to a pipeline:
from chonkie import Pipeline

# Process a PDF with OCR and chunk it
doc = (Pipeline()
    .fetch_from("file", path="document.pdf")
    .process_with("mistral")
    .chunk_with("recursive", chunk_size=512)
    .run())

print(f"Extracted {len(doc.chunks)} chunks from PDF")

OCR + RAG Pipeline

Build a complete pipeline from scanned documents to vector database:
from chonkie import Pipeline

docs = (Pipeline()
    .fetch_from("file", dir="./scanned_docs", ext=[".pdf", ".png"])
    .process_with("mistral")
    .chunk_with("recursive", chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant", collection_name="scanned_documents")
    .run())

print(f"Ingested {len(docs)} documents")

OCR + Semantic Chunking

Use semantic chunking on OCR output for intelligent retrieval boundaries:
from chonkie import Pipeline

doc = (Pipeline()
    .fetch_from("file", path="textbook_chapter.pdf")
    .process_with("mistral")
    .chunk_with("semantic", threshold=0.8, chunk_size=1024)
    .refine_with("embedding", model="text-embedding-3-small")
    .export_with("json", file="textbook_chunks.json")
    .run())

Integration with Chunkers

MistralOCR returns a MarkdownDocument, making it compatible with any chunker:
from chonkie import MistralOCR, RecursiveChunker

# Step 1: Extract text from PDF
ocr = MistralOCR()
doc = ocr.process("report.pdf")

# Step 2: Chunk the extracted content
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker.chunk(doc.content)

# Step 3: Store chunks in the document
doc.chunks = chunks

print(f"Document: {doc.metadata['filename']}")
print(f"  Content: {len(doc.content)} characters")
print(f"  Chunks: {len(doc.chunks)}")

Notes

  • OCR quality depends on image resolution and clarity
  • Large PDFs are processed page-by-page and concatenated with double newlines
  • The extracted text is returned as markdown, preserving structure from the source document
  • API calls are synchronous by default; use aprocess() for async execution