Skip to main content
The FileFetcher retrieves files from your local filesystem. It supports two modes: fetching a single file or fetching multiple files from a directory with optional extension filtering.

Installation

FileFetcher is included with the base Chonkie installation:
pip install chonkie

Usage

Single File Mode

Fetch a single file by providing the path parameter:
from chonkie.pipeline import Pipeline

# Fetch and process a single file
doc = (Pipeline()
    .fetch_from("file", path="document.txt")
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

print(f"Chunked into {len(doc.chunks)} chunks")

Directory Mode

Fetch multiple files from a directory using the dir parameter:
# Fetch all files from a directory
docs = (Pipeline()
    .fetch_from("file", dir="./documents")
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

print(f"Processed {len(docs)} documents")
for doc in docs:
    print(f"  - {len(doc.chunks)} chunks")

Extension Filtering

Filter files by extension when using directory mode:
# Fetch only .txt and .md files
docs = (Pipeline()
    .fetch_from("file", dir="./documents", ext=[".txt", ".md"])
    .process_with("text")
    .chunk_with("recursive", chunk_size=512)
    .run())

Parameters

path
str
Path to a single file. Cannot be used with dir.
dir
str
Directory to fetch files from. Cannot be used with path.
ext
List[str]
List of file extensions to filter (e.g., [".txt", ".md"]). Only used with dir parameter.

Return Values

  • Single file mode (path provided): Returns a single Path object
  • Directory mode (dir provided): Returns List[Path] containing all matching files

Standalone Usage

You can also use FileFetcher directly without the pipeline:
from chonkie import FileFetcher

fetcher = FileFetcher()

# Single file
file_path = fetcher.fetch(path="document.txt")
print(file_path)  # PosixPath('document.txt')

# Directory with extension filter
files = fetcher.fetch(dir="./docs", ext=[".txt", ".md"])
for file in files:
    print(file)

Error Handling

FileFetcher validates inputs and provides clear error messages:
# FileNotFoundError if file doesn't exist
fetcher.fetch(path="nonexistent.txt")  # Raises FileNotFoundError

# ValueError if both path and dir are provided
fetcher.fetch(path="file.txt", dir="./docs")  # Raises ValueError

# ValueError if neither is provided
fetcher.fetch()  # Raises ValueError

Best Practices

When working with directories containing many files, always specify ext to avoid processing unwanted files:
# Good - only processes markdown files
.fetch_from("file", dir="./docs", ext=[".md"])

# Potentially slow - processes ALL files
.fetch_from("file", dir="./docs")
While relative paths work, absolute paths make your pipeline more portable:
from pathlib import Path

docs_dir = Path(__file__).parent / "documents"
.fetch_from("file", dir=str(docs_dir), ext=[".txt"])

What’s Next?

After fetching files, you’ll typically want to:
  1. Process them with a Chef to parse content
  2. Chunk them with a Chunker to split into manageable pieces
  3. Refine chunks with Refineries for better quality
See the Pipeline Guide for complete examples.
I