LiteParse extracts text from PDFs, office documents, and images using LiteParse from LlamaIndex. It runs entirely locally with no cloud API dependencies.
Installation
LiteParse runs locally. OCR uses bundled Tesseract by default. Office document conversion requires LibreOffice, and image conversion requires ImageMagick.
Initialization
Parameters
Whether to enable OCR for scanned/image text (defaults to LiteParse’s behavior when
None).Language code for OCR (e.g.,
"eng", "fra", "deu").Optional HTTP OCR server URL (e.g., EasyOCR or PaddleOCR server).
Maximum number of pages to parse.
Specific pages to parse (e.g.,
"1-5,10").Rendering resolution for PDF pages.
Number of pages to OCR in parallel (defaults to CPU cores - 1).
Password for protected PDFs.
Methods
process()
Process a file and return aDocument.
Parameters
Path to the file to process.
Returns
Document containing the extracted text content.
process_batch()
Process multiple files at once.Parameters
List of file paths to process.
Returns
list[Document] where each document contains extracted text from a file.
parse()
Parse raw text into aDocument (wraps text as-is, since LiteParse operates on files).
Parameters
Raw text to wrap into a Document.
Returns
Document containing the provided text.
Supported File Types
| Type | Extensions |
|---|---|
.pdf | |
| Word | .doc, .docx, .docm, .odt, .rtf |
| PowerPoint | .ppt, .pptx, .pptm, .odp |
| Spreadsheets | .xls, .xlsx, .xlsm, .ods, .csv, .tsv |
| Images | .png, .jpg, .jpeg, .gif, .bmp, .tiff, .tif, .webp, .svg |
Usage
Standalone
Pipeline
Use.process_with("liteparse") to add local document parsing to a pipeline:
Local RAG Pipeline
Build a complete pipeline from documents to vector database without any cloud OCR:Targeted Page Extraction
Parse only specific pages from a large PDF:Integration with Chunkers
LiteParse returns aDocument, making it compatible with any chunker:
Notes
- Runs entirely locally with no API keys or cloud dependencies
- OCR quality depends on image resolution and the Tesseract language pack
- Office documents (Word, PowerPoint, Excel) require LibreOffice to be installed
- Image files require ImageMagick to be installed
- Use
num_workersto control parallelism for multi-page OCR - Use
target_pagesfor efficient extraction from large PDFs - API calls are synchronous by default; use
aprocess()for async execution
