Document loaders
Reads .pdf, .docx, .xlsx, .csv, .tsv, .md, .txt, and
.html files into a normalized Document whose content is
markdown text. All loaders live under loomflow.loader.
One-liner: auto-detect by extension
from loomflow.loader import load
doc = load("research.pdf")
print(doc.content[:500]) # markdown
print(doc.metadata)
# {"source": "research.pdf", "format": "pdf", "page_count": 42,
# "title": "Research Report", "backend": "unstructured", "strategy": "fast"}Per-format loaders
from loomflow.loader import (
load_pdf, load_docx, load_excel,
load_csv, load_tsv,
load_markdown, load_text, load_html,
)
doc = load_pdf("research.pdf")
doc = load_docx("brief.docx")
doc = load_excel("metrics.xlsx") # → markdown table per sheet
doc = load_csv("rows.csv") # → markdown table
doc = load_tsv("rows.tsv")
doc = load_markdown("notes.md") # passthrough
doc = load_text("plain.txt") # passthrough
doc = load_html("page.html") # → markdown via BeautifulSoupDocument shape
@dataclass
class Document:
content: str # full markdown
metadata: dict[str, Any] # source / format / format-specificmetadata always contains:
source. The source file path (str).format. One of"pdf","docx","xlsx","csv","tsv","md","txt","html".
Format-specific keys are added when relevant: page_count /
title / backend / strategy for PDFs, sheet_names for Excel,
row_count for CSVs, etc.
Loader output
| Source | Markdown output |
|---|---|
| Element-aware: titles, paragraphs, lists, tables, per-page sections. Two interchangeable backends. See below. | |
| DOCX | Headings + paragraphs preserved with proper hierarchy. |
| Excel | One markdown table per sheet, prefixed with ## SheetName. |
| CSV / TSV | One big markdown table. |
| HTML | Headings + paragraphs + lists preserved; scripts / styles stripped. |
| Markdown / text | Passed through unchanged. |
PDF: two interchangeable backends
load_pdf() ships two backends, picked at load time via backend=.
Both produce the same Document shape, so downstream code (chunkers,
vector stores) doesn’t care which was used.
from loomflow.loader import load_pdf
# Default — unstructured, fast strategy
doc = load_pdf("research.pdf")
# Tune the unstructured strategy
doc = load_pdf("scanned.pdf", strategy="ocr_only", languages=["eng", "fra"])
doc = load_pdf("complex_layout.pdf", strategy="hi_res")
# Or switch to docling (IBM Research)
doc = load_pdf("research.pdf", backend="docling")backend="unstructured" (default)
Wraps the unstructured library (Apache 2.0; the same engine
behind LangChain’s UnstructuredPDFLoader). Element-level parsing
with categories, Title / NarrativeText / Table / ListItem
/ Image. And per-page metadata. tested across thousands
of RAG pipelines.
Three strategy modes:
strategy= | Engine | Best for |
|---|---|---|
"fast" (default) | pdfminer.six (pure Python) | Native text PDFs. Fast. No model downloads. |
"hi_res" | YOLO layout detection (unstructured-inference) | Multi-column / table-heavy / mixed layouts. |
"ocr_only" | Tesseract OCR | Scanned or image-only PDFs. |
languages=["eng", "fra"] sets OCR / layout languages. Only
meaningful with hi_res and ocr_only. Defaults to English.
backend="docling"
Wraps docling (MIT, IBM Research). ML-based, structure-aware
extraction; the 2026 best-in-class benchmark winner for native
PDFs. Outputs clean markdown with hierarchy preserved (titles,
sections, tables, lists). Slower first run (downloads layout
model on first use, then cached); comparable speed afterwards.
The strategy= and languages= kwargs are ignored by the docling
backend. It always runs its full layout-aware pipeline.
Picking a backend
| You have… | Use |
|---|---|
| Native text PDFs, lots of them, want speed | "unstructured" + strategy="fast" (the default) |
| Multi-column research papers / financial reports | "unstructured" + strategy="hi_res" |
| Scanned / image-only PDFs | "unstructured" + strategy="ocr_only" + languages= |
| Want the best quality on native PDFs and don’t mind a one-time model download | "docling" |
Failure handling
PDFs vary wildly. When extraction fails on a specific file, load_pdf:
- Emits a
RuntimeWarningwith the backend, strategy, and the underlying exception. - Logs at
WARNINGtoloomflow.loader.pdf. - Returns a non-fatal empty
Documentwithextraction_errorinmetadataso the pipeline keeps going on bad inputs.
import warnings
from loomflow.loader import load_pdf
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
doc = load_pdf("broken.pdf")
if w:
print(w[0].message) # "unstructured failed to parse 'broken.pdf' ..."
if doc.metadata.get("extraction_error"):
handle_unparseable(doc)Migration from pypdf (pre-0.10). Older versions of the framework
shipped a pypdf-backed PDF loader. It silently produced empty
text on multi-column or table-heavy PDFs. The classic “questions
about content near the end of the PDF go unanswered” symptom. The
new loader is a strict upgrade; the public API stayed the same
except for the new keyword args. Existing code calling
load_pdf(path) keeps working unchanged.
Optional dependencies
Default (unstructured)
pip install 'loomflow[loader-pdf]'Pulls unstructured[pdf]>=0.15. Covers strategy="fast" (the
default) by default. For hi_res / ocr_only, also install
Tesseract (system binary) and the model deps:
pip install 'unstructured[pdf,ocr]' # OCR extras
# macOS: brew install tesseract poppler
# Debian: apt-get install -y tesseract-ocr poppler-utilsload_pdf / load_docx / etc. raise ImportError with the right
pip install hint if a dependency isn’t available.
Loading a folder
from pathlib import Path
from loomflow.loader import load
docs = [load(str(p)) for p in Path("docs/").glob("**/*.pdf")]The metadata["source"] carries the path so you can disambiguate
chunks back to their files later.
For mixed-quality corpora, pass an explicit backend / strategy via
load_pdf instead of the auto-dispatch load:
from loomflow.loader import load_pdf
docs = [
load_pdf(str(p), strategy="hi_res")
for p in Path("scans/").glob("*.pdf")
]Why markdown? Every chunker downstream expects markdown. It’s the lingua franca that preserves structure (headings, tables, lists) while staying easy for the LLM to read. The loaders normalize once; chunkers and vector stores never need to know the source format.
Next
→ Chunkers. Splitting the markdown into LLM-friendly pieces.