Chonkie Documentation

The FileFetcher retrieves files from your local filesystem. It supports two modes: fetching a single file or fetching multiple files from a directory with optional extension filtering.

Installation

FileFetcher is included with the base Chonkie installation:

pip install chonkie

Usage

Single File Mode

Fetch a single file by providing the path parameter:

from chonkie.pipeline import Pipeline

# Fetch and process a single file
doc = (Pipeline()
    .fetch_from("file", path="document.txt")
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

print(f"Chunked into {len(doc.chunks)} chunks")

Directory Mode

Fetch multiple files from a directory using the dir parameter:

# Fetch all files from a directory
docs = (Pipeline()
    .fetch_from("file", dir="./documents")
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

print(f"Processed {len(docs)} documents")
for doc in docs:
    print(f"  - {len(doc.chunks)} chunks")

Extension Filtering

Filter files by extension when using directory mode:

# Fetch only .txt and .md files
docs = (Pipeline()
    .fetch_from("file", dir="./documents", ext=[".txt", ".md"])
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

Parameters

path

str

Path to a single file. Cannot be used with dir.

dir

str

Directory to fetch files from. Cannot be used with path.

ext

List[str]

List of file extensions to filter (e.g., [".txt", ".md"]). Only used with dir parameter.

Return Values

Single file mode (path provided): Returns a single Path object
Directory mode (dir provided): Returns List[Path] containing all matching files

Standalone Usage

You can also use FileFetcher directly without the pipeline:

from chonkie import FileFetcher

fetcher = FileFetcher()

# Single file
file_path = fetcher.fetch(path="document.txt")
print(file_path)  # PosixPath('document.txt')

# Directory with extension filter
files = fetcher.fetch(dir="./docs", ext=[".txt", ".md"])
for file in files:
    print(file)

Error Handling

FileFetcher validates inputs and provides clear error messages:

# FileNotFoundError if file doesn't exist
fetcher.fetch(path="nonexistent.txt")  # Raises FileNotFoundError

# ValueError if both path and dir are provided
fetcher.fetch(path="file.txt", dir="./docs")  # Raises ValueError

# ValueError if neither is provided
fetcher.fetch()  # Raises ValueError

Best Practices

Use extension filtering for large directories

When working with directories containing many files, always specify ext to avoid processing unwanted files:

# Good - only processes markdown files
.fetch_from("file", dir="./docs", ext=[".md"])

# Potentially slow - processes ALL files
.fetch_from("file", dir="./docs")

Use absolute paths for clarity

While relative paths work, absolute paths make your pipeline more portable:

from pathlib import Path

docs_dir = Path(__file__).parent / "documents"
.fetch_from("file", dir=str(docs_dir), ext=[".txt"])

What’s Next?

After fetching files, you’ll typically want to:

Process them with a Chef to parse content
Chunk them with a Chunker to split into manageable pieces
Refine chunks with Refineries for better quality

See the Pipeline Guide for complete examples.

Getting Started

Chefs

Fetchers

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Deprecated

Changelog

FileFetcher

Installation

Usage

Single File Mode

Directory Mode

Extension Filtering

Parameters

Return Values

Standalone Usage

Error Handling

Best Practices

What’s Next?

Getting Started

Chefs

Fetchers

Chunkers

Embeddings

Refinery

Handshakes

Porters

Utils

Experimental

Deprecated

Changelog

​Installation

​Usage

​Single File Mode

​Directory Mode

​Extension Filtering

​Parameters

​Return Values

​Standalone Usage

​Error Handling

​Best Practices

​What’s Next?

Installation

Usage

Single File Mode

Directory Mode

Extension Filtering

Parameters

Return Values

Standalone Usage

Error Handling

Best Practices

What’s Next?