TheDocumentation Index
Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt
Use this file to discover all available pages before exploring further.
MistralOCR chef extracts text from images and PDF files using Mistral’s OCR API, returning structured MarkdownDocument objects for further processing.
Installation
You need a Mistral API key. Set the
MISTRAL_API_KEY environment variable or pass it directly.Initialization
Parameters
The Mistral OCR model to use.
Mistral API key. Falls back to the
MISTRAL_API_KEY environment variable.Methods
process()
Process an image or PDF file and return aMarkdownDocument.
Parameters
Path to the image or PDF file.
Returns
MarkdownDocument containing the extracted text as markdown content.
process_batch()
Process multiple image or PDF files at once.Parameters
List of file paths to process.
Returns
list[MarkdownDocument] where each document contains extracted text from a file.
parse()
Parse raw text into aDocument (wraps text as-is, since OCR operates on files).
Parameters
Raw text to wrap into a Document.
Returns
Document containing the provided text.
Supported File Types
| Type | Extensions |
|---|---|
| Images | .png, .jpg, .jpeg, .gif, .bmp, .webp, .tiff, .tif |
| Documents | .pdf |
Usage
Standalone
Pipeline
Use.process_with("mistral") to add OCR to a pipeline:
OCR + RAG Pipeline
Build a complete pipeline from scanned documents to vector database:OCR + Semantic Chunking
Use semantic chunking on OCR output for intelligent retrieval boundaries:Integration with Chunkers
MistralOCR returns aMarkdownDocument, making it compatible with any chunker:
Notes
- OCR quality depends on image resolution and clarity
- Large PDFs are processed page-by-page and concatenated with double newlines
- The extracted text is returned as markdown, preserving structure from the source document
- API calls are synchronous by default; use
aprocess()for async execution
