Chonkie Documentation

LiteParse extracts text from PDFs, office documents, and images using LiteParse from LlamaIndex. It runs entirely locally with no cloud API dependencies.

Installation

pip install chonkie[liteparse]

LiteParse runs locally. OCR uses bundled Tesseract by default. Office document conversion requires LibreOffice, and image conversion requires ImageMagick.

Initialization

from chonkie import LiteParse

# Default initialization
chef = LiteParse()

# Custom configuration
chef = LiteParse(
    ocr_enabled=True,
    ocr_language="eng",
    dpi=300,
    max_pages=100,
    target_pages="1-10",
    num_workers=8,
)

Parameters

Optional[bool]

default:"None"

Whether to enable OCR for scanned/image text (defaults to LiteParse’s behavior when None).

Optional[str]

default:"None"

Language code for OCR (e.g., "eng", "fra", "deu").

Optional[str]

default:"None"

Optional HTTP OCR server URL (e.g., EasyOCR or PaddleOCR server).

Optional[int]

default:"None"

Maximum number of pages to parse.

Optional[str]

default:"None"

Specific pages to parse (e.g., "1-5,10").

Optional[float]

default:"None"

Rendering resolution for PDF pages.

Optional[int]

default:"None"

Number of pages to OCR in parallel (defaults to CPU cores - 1).

Optional[str]

default:"None"

Password for protected PDFs.

Methods

process()

Process a file and return a Document.

Parameters

Union[str, Path]

required

Path to the file to process.

Returns

Document containing the extracted text content.

process_batch()

Process multiple files at once.

Parameters

list[Union[str, Path]]

required

List of file paths to process.

Returns

list[Document] where each document contains extracted text from a file.

parse()

Parse raw text into a Document (wraps text as-is, since LiteParse operates on files).

Parameters

str

required

Raw text to wrap into a Document.

Returns

Document containing the provided text.

Supported File Types

Type	Extensions
PDF	`.pdf`
Word	`.doc`, `.docx`, `.docm`, `.odt`, `.rtf`
PowerPoint	`.ppt`, `.pptx`, `.pptm`, `.odp`
Spreadsheets	`.xls`, `.xlsx`, `.xlsm`, `.ods`, `.csv`, `.tsv`
Images	`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif`, `.webp`, `.svg`

Usage

Standalone

from chonkie import LiteParse

chef = LiteParse()

# Single file
doc = chef.process("research_paper.pdf")
print(doc.content)
print(f"Source: {doc.metadata['filename']}")

# Multiple files
docs = chef.process_batch(["report.pdf", "slides.pptx", "data.xlsx"])

# Async
import asyncio
doc = asyncio.run(chef.aprocess("document.pdf"))

Pipeline

Use .process_with("liteparse") to add local document parsing to a pipeline:

from chonkie import Pipeline

# Process a PDF locally and chunk it
doc = (Pipeline()
    .fetch_from("file", path="document.pdf")
    .process_with("liteparse")
    .chunk_with("recursive", chunk_size=512)
    .run())

print(f"Extracted {len(doc.chunks)} chunks from PDF")

Local RAG Pipeline

Build a complete pipeline from documents to vector database without any cloud OCR:

from chonkie import Pipeline

docs = (Pipeline()
    .fetch_from("file", dir="./documents", ext=[".pdf", ".docx", ".pptx"])
    .process_with("liteparse")
    .chunk_with("recursive", chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant", collection_name="local_documents")
    .run())

print(f"Ingested {len(docs)} documents")

Targeted Page Extraction

Parse only specific pages from a large PDF:

from chonkie import LiteParse

chef = LiteParse(target_pages="1-5,10,15-20", dpi=300)
doc = chef.process("large_report.pdf")
print(f"Extracted {len(doc.content)} characters from selected pages")

Integration with Chunkers

LiteParse returns a Document, making it compatible with any chunker:

from chonkie import LiteParse, RecursiveChunker

# Step 1: Extract text from PDF
chef = LiteParse()
doc = chef.process("report.pdf")

# Step 2: Chunk the extracted content
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker.chunk(doc.content)

# Step 3: Store chunks in the document
doc.chunks = chunks

print(f"Document: {doc.metadata['filename']}")
print(f"  Content: {len(doc.content)} characters")
print(f"  Chunks: {len(doc.chunks)}")

Notes

Runs entirely locally with no API keys or cloud dependencies
OCR quality depends on image resolution and the Tesseract language pack
Office documents (Word, PowerPoint, Excel) require LibreOffice to be installed
Image files require ImageMagick to be installed
Use num_workers to control parallelism for multi-page OCR
Use target_pages for efficient extraction from large PDFs
API calls are synchronous by default; use aprocess() for async execution

Getting Started

API Server

Chefs

Fetchers

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Deprecated

Changelog

LiteParse

Installation

Initialization

Parameters

Methods

process()

Parameters

Returns

process_batch()

Parameters

Returns

parse()

Parameters

Returns

Supported File Types

Usage

Standalone

Pipeline

Local RAG Pipeline

Targeted Page Extraction

Integration with Chunkers

Notes

​Installation

​Initialization

​Parameters

​Methods

​process()

​Parameters

​Returns

​process_batch()

​Parameters

​Returns

​parse()

​Parameters

​Returns

​Supported File Types

​Usage

​Standalone

​Pipeline

​Local RAG Pipeline

​Targeted Page Extraction

​Integration with Chunkers

​Notes

Installation

Initialization

Parameters

Methods

process()

Parameters

Returns

process_batch()

Parameters

Returns

parse()

Parameters

Returns

Supported File Types

Usage

Standalone

Pipeline

Local RAG Pipeline

Targeted Page Extraction

Integration with Chunkers

Notes