Document loaders

Reads .pdf, .docx, .xlsx, .csv, .tsv, .md, .txt, and .html files into a normalized Document whose content is markdown text. All loaders live under loomflow.loader.

One-liner: auto-detect by extension


from loomflow.loader import load
 
doc = load("research.pdf")
print(doc.content[:500])     # markdown
print(doc.metadata)
# {"source": "research.pdf", "format": "pdf", "page_count": 42,
#  "title": "Research Report", "backend": "unstructured", "strategy": "fast"}

Per-format loaders


from loomflow.loader import (
    load_pdf, load_docx, load_excel,
    load_csv, load_tsv,
    load_markdown, load_text, load_html,
)
 
doc = load_pdf("research.pdf")
doc = load_docx("brief.docx")
doc = load_excel("metrics.xlsx")        # → markdown table per sheet
doc = load_csv("rows.csv")              # → markdown table
doc = load_tsv("rows.tsv")
doc = load_markdown("notes.md")         # passthrough
doc = load_text("plain.txt")            # passthrough
doc = load_html("page.html")            # → markdown via BeautifulSoup

Document shape


@dataclass
class Document:
    content: str                        # full markdown
    metadata: dict[str, Any]            # source / format / format-specific

metadata always contains:

source. The source file path (str).
format. One of "pdf", "docx", "xlsx", "csv", "tsv", "md", "txt", "html".

Format-specific keys are added when relevant: page_count / title / backend / strategy for PDFs, sheet_names for Excel, row_count for CSVs, etc.

Loader output

Source	Markdown output
PDF	Element-aware: titles, paragraphs, lists, tables, per-page sections. Two interchangeable backends. See below.
DOCX	Headings + paragraphs preserved with proper hierarchy.
Excel	One markdown table per sheet, prefixed with `## SheetName`.
CSV / TSV	One big markdown table.
HTML	Headings + paragraphs + lists preserved; scripts / styles stripped.
Markdown / text	Passed through unchanged.

PDF: two interchangeable backends

load_pdf() ships two backends, picked at load time via backend=. Both produce the same Document shape, so downstream code (chunkers, vector stores) doesn’t care which was used.


from loomflow.loader import load_pdf
 
# Default — unstructured, fast strategy
doc = load_pdf("research.pdf")
 
# Tune the unstructured strategy
doc = load_pdf("scanned.pdf", strategy="ocr_only", languages=["eng", "fra"])
doc = load_pdf("complex_layout.pdf", strategy="hi_res")
 
# Or switch to docling (IBM Research)
doc = load_pdf("research.pdf", backend="docling")

`backend="unstructured"` (default)

Wraps the unstructured library (Apache 2.0; the same engine behind LangChain’s UnstructuredPDFLoader). Element-level parsing with categories, Title / NarrativeText / Table / ListItem / Image. And per-page metadata. tested across thousands of RAG pipelines.

Three strategy modes:

`strategy=`	Engine	Best for
`"fast"` (default)	`pdfminer.six` (pure Python)	Native text PDFs. Fast. No model downloads.
`"hi_res"`	YOLO layout detection (`unstructured-inference`)	Multi-column / table-heavy / mixed layouts.
`"ocr_only"`	Tesseract OCR	Scanned or image-only PDFs.

languages=["eng", "fra"] sets OCR / layout languages. Only meaningful with hi_res and ocr_only. Defaults to English.

`backend="docling"`

Wraps docling (MIT, IBM Research). ML-based, structure-aware extraction; the 2026 best-in-class benchmark winner for native PDFs. Outputs clean markdown with hierarchy preserved (titles, sections, tables, lists). Slower first run (downloads layout model on first use, then cached); comparable speed afterwards.

The strategy= and languages= kwargs are ignored by the docling backend. It always runs its full layout-aware pipeline.

Picking a backend

You have…	Use
Native text PDFs, lots of them, want speed	`"unstructured"` + `strategy="fast"` (the default)
Multi-column research papers / financial reports	`"unstructured"` + `strategy="hi_res"`
Scanned / image-only PDFs	`"unstructured"` + `strategy="ocr_only"` + `languages=`
Want the best quality on native PDFs and don’t mind a one-time model download	`"docling"`

Failure handling

PDFs vary wildly. When extraction fails on a specific file, load_pdf:

Emits a RuntimeWarning with the backend, strategy, and the underlying exception.
Logs at WARNING to loomflow.loader.pdf.
Returns a non-fatal empty Document with extraction_error in metadata so the pipeline keeps going on bad inputs.


import warnings
from loomflow.loader import load_pdf
 
with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    doc = load_pdf("broken.pdf")
    if w:
        print(w[0].message)        # "unstructured failed to parse 'broken.pdf' ..."
 
if doc.metadata.get("extraction_error"):
    handle_unparseable(doc)

Migration from pypdf (pre-0.10). Older versions of the framework shipped a pypdf-backed PDF loader. It silently produced empty text on multi-column or table-heavy PDFs. The classic “questions about content near the end of the PDF go unanswered” symptom. The new loader is a strict upgrade; the public API stayed the same except for the new keyword args. Existing code calling load_pdf(path) keeps working unchanged.

Optional dependencies

Default (unstructured)


pip install 'loomflow[loader-pdf]'

Pulls unstructured[pdf]>=0.15. Covers strategy="fast" (the default) by default. For hi_res / ocr_only, also install Tesseract (system binary) and the model deps:


pip install 'unstructured[pdf,ocr]'   # OCR extras
# macOS:    brew install tesseract poppler
# Debian:   apt-get install -y tesseract-ocr poppler-utils

Add docling backend


pip install 'loomflow[loader-pdf-docling]'

Adds docling>=2.0 and its layout-model dependencies. Use with load_pdf(path, backend="docling"). First run downloads the layout model (~250 MB); cached afterwards.

You can install both backends side-by-side and pick per call.

All loaders


pip install 'loomflow[loader]'

All non-PDF loaders + unstructured[pdf] for PDF. For docling add loader-pdf-docling separately.

Per-format extras for non-PDF formats:


pip install 'loomflow[loader-docx]'         # python-docx
pip install 'loomflow[loader-excel]'        # openpyxl
pip install 'loomflow[loader-html]'         # beautifulsoup4
pip install 'loomflow[loader-token]'        # tiktoken (for TokenChunker)

load_pdf / load_docx / etc. raise ImportError with the right pip install hint if a dependency isn’t available.

Loading a folder


from pathlib import Path
from loomflow.loader import load
 
docs = [load(str(p)) for p in Path("docs/").glob("**/*.pdf")]

The metadata["source"] carries the path so you can disambiguate chunks back to their files later.

For mixed-quality corpora, pass an explicit backend / strategy via load_pdf instead of the auto-dispatch load:


from loomflow.loader import load_pdf
 
docs = [
    load_pdf(str(p), strategy="hi_res")
    for p in Path("scans/").glob("*.pdf")
]

Why markdown? Every chunker downstream expects markdown. It’s the lingua franca that preserves structure (headings, tables, lists) while staying easy for the LLM to read. The loaders normalize once; chunkers and vector stores never need to know the source format.

→ Chunkers. Splitting the markdown into LLM-friendly pieces.