> ## Documentation Index
> Fetch the complete documentation index at: https://docs.chonkie.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# MistralOCR

> Extract text from images and PDFs using Mistral's OCR API.

The `MistralOCR` chef extracts text from images and PDF files using Mistral's OCR API, returning structured `MarkdownDocument` objects for further processing.

## Installation

```bash theme={"system"}
pip install chonkie[mistral]
```

<Info>
  You need a Mistral API key. Set the `MISTRAL_API_KEY` environment variable or pass it directly.
</Info>

## Initialization

```python theme={"system"}
from chonkie import MistralOCR

# Default initialization (uses MISTRAL_API_KEY env var)
ocr = MistralOCR()

# Custom model and explicit API key
ocr = MistralOCR(model="mistral-ocr-2505", api_key="sk-...")
```

### Parameters

<ParamField path="model" type="str" default="mistral-ocr-latest">
  The Mistral OCR model to use.
</ParamField>

<ParamField path="api_key" type="Optional[str]" default="None">
  Mistral API key. Falls back to the `MISTRAL_API_KEY` environment variable.
</ParamField>

## Methods

### process()

Process an image or PDF file and return a `MarkdownDocument`.

#### Parameters

<ParamField path="path" type="Union[str, Path]" required>
  Path to the image or PDF file.
</ParamField>

#### Returns

`MarkdownDocument` containing the extracted text as markdown content.

### process\_batch()

Process multiple image or PDF files at once.

#### Parameters

<ParamField path="paths" type="list[Union[str, Path]]" required>
  List of file paths to process.
</ParamField>

#### Returns

`list[MarkdownDocument]` where each document contains extracted text from a file.

### parse()

Parse raw text into a `Document` (wraps text as-is, since OCR operates on files).

#### Parameters

<ParamField path="text" type="str" required>
  Raw text to wrap into a Document.
</ParamField>

#### Returns

`Document` containing the provided text.

## Supported File Types

| Type      | Extensions                                                        |
| --------- | ----------------------------------------------------------------- |
| Images    | `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.webp`, `.tiff`, `.tif` |
| Documents | `.pdf`                                                            |

## Usage

### Standalone

```python theme={"system"}
from chonkie import MistralOCR

ocr = MistralOCR()

# Single file
doc = ocr.process("research_paper.pdf")
print(doc.content)
print(f"Source: {doc.metadata['filename']}")

# Multiple files
docs = ocr.process_batch(["page1.png", "page2.png"])

# Async
import asyncio
doc = asyncio.run(ocr.aprocess("document.pdf"))
```

### Pipeline

Use `.process_with("mistral")` to add OCR to a pipeline:

```python theme={"system"}
from chonkie import Pipeline

# Process a PDF with OCR and chunk it
doc = (Pipeline()
    .fetch_from("file", path="document.pdf")
    .process_with("mistral")
    .chunk_with("recursive", chunk_size=512)
    .run())

print(f"Extracted {len(doc.chunks)} chunks from PDF")
```

### OCR + RAG Pipeline

Build a complete pipeline from scanned documents to vector database:

```python theme={"system"}
from chonkie import Pipeline

docs = (Pipeline()
    .fetch_from("file", dir="./scanned_docs", ext=[".pdf", ".png"])
    .process_with("mistral")
    .chunk_with("recursive", chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant", collection_name="scanned_documents")
    .run())

print(f"Ingested {len(docs)} documents")
```

### OCR + Semantic Chunking

Use semantic chunking on OCR output for intelligent retrieval boundaries:

```python theme={"system"}
from chonkie import Pipeline

doc = (Pipeline()
    .fetch_from("file", path="textbook_chapter.pdf")
    .process_with("mistral")
    .chunk_with("semantic", threshold=0.8, chunk_size=1024)
    .refine_with("embedding", model="text-embedding-3-small")
    .export_with("json", file="textbook_chunks.json")
    .run())
```

## Integration with Chunkers

MistralOCR returns a `MarkdownDocument`, making it compatible with any chunker:

```python theme={"system"}
from chonkie import MistralOCR, RecursiveChunker

# Step 1: Extract text from PDF
ocr = MistralOCR()
doc = ocr.process("report.pdf")

# Step 2: Chunk the extracted content
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker.chunk(doc.content)

# Step 3: Store chunks in the document
doc.chunks = chunks

print(f"Document: {doc.metadata['filename']}")
print(f"  Content: {len(doc.content)} characters")
print(f"  Chunks: {len(doc.chunks)}")
```

## Notes

* OCR quality depends on image resolution and clarity
* Large PDFs are processed page-by-page and concatenated with double newlines
* The extracted text is returned as markdown, preserving structure from the source document
* API calls are synchronous by default; use `aprocess()` for async execution