The MarkdownChef
processes markdown files and strings, extracting tables, code blocks, and images into a structured MarkdownDocument
.
It intelligently parses markdown content and separates it into distinct components while preserving their positions in the original text.
Installation
MarkdownChef is included in the base installation of Chonkie. No additional dependencies are required.
Initialization
from chonkie import MarkdownChef
# Basic initialization with default tokenizer
chef = MarkdownChef()
# Initialize with a specific tokenizer
chef = MarkdownChef(tokenizer="gpt2")
# Or use a custom tokenizer instance
from transformers import AutoTokenizer
custom_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
chef = MarkdownChef(tokenizer=custom_tokenizer)
Parameters
tokenizer
Union[TokenizerProtocol, str]
default:"character"
Tokenizer to use for counting tokens in text chunks. Can be a string
identifier (“character”, “gpt2”, etc.) or a tokenizer instance that follows
the TokenizerProtocol.
Methods
process()
Process a markdown file.
Parameters
Path to the markdown file (string or Path object)
Returns
MarkdownDocument
containing parsed content with extracted tables, code, images, and text chunks
process_batch()
Process multiple markdown files at once.
Parameters
paths
List[Union[str, Path]]
required
List of file paths to process
Returns
List of MarkdownDocument
objects
Basic Usage
from chonkie import MarkdownChef
# Initialize the chef
chef = MarkdownChef()
# Process a markdown file
doc = chef.process("example.md")
# Access the extracted components
print(f"Found {len(doc.tables)} tables")
print(f"Found {len(doc.code)} code blocks")
print(f"Found {len(doc.images)} images")
print(f"Found {len(doc.chunks)} text chunks")
Return Type
MarkdownChef returns a MarkdownDocument
object, which extends the base Document
class with additional fields:
@dataclass
class MarkdownTable:
content: str # The table content
start_index: int # Starting position in original text
end_index: int # Ending position in original text
@dataclass
class MarkdownCode:
content: str # The code content
language: Optional[str] # Programming language (if specified)
start_index: int # Starting position in original text
end_index: int # Ending position in original text
@dataclass
class MarkdownImage:
alias: str # Alt text or filename
content: str # Image path or data URL
start_index: int # Starting position in original text
end_index: int # Ending position in original text
link: Optional[str] # Link URL (if image is clickable)
@dataclass
class MarkdownDocument(Document):
id: str # Unique document ID
content: str # Full markdown content
tables: List[MarkdownTable] # Extracted tables
code: List[MarkdownCode] # Extracted code blocks
images: List[MarkdownImage] # Extracted images
chunks: List[Chunk] # Remaining text chunks
metadata: Dict[str, Any] # Additional metadata