Skip to main content
LiteParse extracts text from PDFs, office documents, and images using LiteParse from LlamaIndex. It runs entirely locally with no cloud API dependencies.

Installation

pip install chonkie[liteparse]
LiteParse runs locally. OCR uses bundled Tesseract by default. Office document conversion requires LibreOffice, and image conversion requires ImageMagick.

Initialization

from chonkie import LiteParse

# Default initialization
chef = LiteParse()

# Custom configuration
chef = LiteParse(
    ocr_enabled=True,
    ocr_language="eng",
    dpi=300,
    max_pages=100,
    target_pages="1-10",
    num_workers=8,
)

Parameters

ocr_enabled
Optional[bool]
default:"None"
Whether to enable OCR for scanned/image text (defaults to LiteParse’s behavior when None).
ocr_language
Optional[str]
default:"None"
Language code for OCR (e.g., "eng", "fra", "deu").
ocr_server_url
Optional[str]
default:"None"
Optional HTTP OCR server URL (e.g., EasyOCR or PaddleOCR server).
max_pages
Optional[int]
default:"None"
Maximum number of pages to parse.
target_pages
Optional[str]
default:"None"
Specific pages to parse (e.g., "1-5,10").
dpi
Optional[float]
default:"None"
Rendering resolution for PDF pages.
num_workers
Optional[int]
default:"None"
Number of pages to OCR in parallel (defaults to CPU cores - 1).
password
Optional[str]
default:"None"
Password for protected PDFs.

Methods

process()

Process a file and return a Document.

Parameters

path
Union[str, Path]
required
Path to the file to process.

Returns

Document containing the extracted text content.

process_batch()

Process multiple files at once.

Parameters

paths
list[Union[str, Path]]
required
List of file paths to process.

Returns

list[Document] where each document contains extracted text from a file.

parse()

Parse raw text into a Document (wraps text as-is, since LiteParse operates on files).

Parameters

text
str
required
Raw text to wrap into a Document.

Returns

Document containing the provided text.

Supported File Types

TypeExtensions
PDF.pdf
Word.doc, .docx, .docm, .odt, .rtf
PowerPoint.ppt, .pptx, .pptm, .odp
Spreadsheets.xls, .xlsx, .xlsm, .ods, .csv, .tsv
Images.png, .jpg, .jpeg, .gif, .bmp, .tiff, .tif, .webp, .svg

Usage

Standalone

from chonkie import LiteParse

chef = LiteParse()

# Single file
doc = chef.process("research_paper.pdf")
print(doc.content)
print(f"Source: {doc.metadata['filename']}")

# Multiple files
docs = chef.process_batch(["report.pdf", "slides.pptx", "data.xlsx"])

# Async
import asyncio
doc = asyncio.run(chef.aprocess("document.pdf"))

Pipeline

Use .process_with("liteparse") to add local document parsing to a pipeline:
from chonkie import Pipeline

# Process a PDF locally and chunk it
doc = (Pipeline()
    .fetch_from("file", path="document.pdf")
    .process_with("liteparse")
    .chunk_with("recursive", chunk_size=512)
    .run())

print(f"Extracted {len(doc.chunks)} chunks from PDF")

Local RAG Pipeline

Build a complete pipeline from documents to vector database without any cloud OCR:
from chonkie import Pipeline

docs = (Pipeline()
    .fetch_from("file", dir="./documents", ext=[".pdf", ".docx", ".pptx"])
    .process_with("liteparse")
    .chunk_with("recursive", chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .store_in("qdrant", collection_name="local_documents")
    .run())

print(f"Ingested {len(docs)} documents")

Targeted Page Extraction

Parse only specific pages from a large PDF:
from chonkie import LiteParse

chef = LiteParse(target_pages="1-5,10,15-20", dpi=300)
doc = chef.process("large_report.pdf")
print(f"Extracted {len(doc.content)} characters from selected pages")

Integration with Chunkers

LiteParse returns a Document, making it compatible with any chunker:
from chonkie import LiteParse, RecursiveChunker

# Step 1: Extract text from PDF
chef = LiteParse()
doc = chef.process("report.pdf")

# Step 2: Chunk the extracted content
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker.chunk(doc.content)

# Step 3: Store chunks in the document
doc.chunks = chunks

print(f"Document: {doc.metadata['filename']}")
print(f"  Content: {len(doc.content)} characters")
print(f"  Chunks: {len(doc.chunks)}")

Notes

  • Runs entirely locally with no API keys or cloud dependencies
  • OCR quality depends on image resolution and the Tesseract language pack
  • Office documents (Word, PowerPoint, Excel) require LibreOffice to be installed
  • Image files require ImageMagick to be installed
  • Use num_workers to control parallelism for multi-page OCR
  • Use target_pages for efficient extraction from large PDFs
  • API calls are synchronous by default; use aprocess() for async execution