Skip to main content
The MarkdownChef processes markdown files and strings, extracting tables, code blocks, and images into a structured MarkdownDocument. It intelligently parses markdown content and separates it into distinct components while preserving their positions in the original text.

Installation

MarkdownChef is included in the base installation of Chonkie. No additional dependencies are required.
For installation instructions, see the Installation Guide.

Initialization

from chonkie import MarkdownChef

# Basic initialization with default tokenizer
chef = MarkdownChef()

# Initialize with a specific tokenizer
chef = MarkdownChef(tokenizer="gpt2")

# Or use a custom tokenizer instance
from transformers import AutoTokenizer
custom_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
chef = MarkdownChef(tokenizer=custom_tokenizer)

Parameters

tokenizer
Union[TokenizerProtocol, str]
default:"character"
Tokenizer to use for counting tokens in text chunks. Can be a string identifier (“character”, “gpt2”, etc.) or a tokenizer instance that follows the TokenizerProtocol.

Methods

process()

Process a markdown file.

Parameters

path
Union[str, Path]
required
Path to the markdown file (string or Path object)

Returns

MarkdownDocument containing parsed content with extracted tables, code, images, and text chunks

process_batch()

Process multiple markdown files at once.

Parameters

paths
List[Union[str, Path]]
required
List of file paths to process

Returns

List of MarkdownDocument objects

Basic Usage

from chonkie import MarkdownChef

# Initialize the chef
chef = MarkdownChef()

# Process a markdown file
doc = chef.process("example.md")

# Access the extracted components
print(f"Found {len(doc.tables)} tables")
print(f"Found {len(doc.code)} code blocks")
print(f"Found {len(doc.images)} images")
print(f"Found {len(doc.chunks)} text chunks")

Return Type

MarkdownChef returns a MarkdownDocument object, which extends the base Document class with additional fields:
@dataclass
class MarkdownTable:
    content: str          # The table content
    start_index: int      # Starting position in original text
    end_index: int        # Ending position in original text

@dataclass
class MarkdownCode:
    content: str              # The code content
    language: Optional[str]   # Programming language (if specified)
    start_index: int          # Starting position in original text
    end_index: int            # Ending position in original text

@dataclass
class MarkdownImage:
    alias: str                # Alt text or filename
    content: str              # Image path or data URL
    start_index: int          # Starting position in original text
    end_index: int            # Ending position in original text
    link: Optional[str]       # Link URL (if image is clickable)

@dataclass
class MarkdownDocument(Document):
    id: str                         # Unique document ID
    content: str                    # Full markdown content
    tables: List[MarkdownTable]     # Extracted tables
    code: List[MarkdownCode]        # Extracted code blocks
    images: List[MarkdownImage]     # Extracted images
    chunks: List[Chunk]             # Remaining text chunks
    metadata: Dict[str, Any]        # Additional metadata
I