Skip to Content
DocsRAGDocument loaders

Document loaders

Reads .pdf, .docx, .xlsx, .csv, .tsv, .md, .txt, and .html files into a normalized Document whose content is markdown text. All loaders live under loomflow.loader.

One-liner: auto-detect by extension

from loomflow.loader import load doc = load("research.pdf") print(doc.content[:500]) # markdown print(doc.metadata) # {"source": "research.pdf", "format": "pdf", "page_count": 42, # "title": "Research Report", "backend": "unstructured", "strategy": "fast"}

Per-format loaders

from loomflow.loader import ( load_pdf, load_docx, load_excel, load_csv, load_tsv, load_markdown, load_text, load_html, ) doc = load_pdf("research.pdf") doc = load_docx("brief.docx") doc = load_excel("metrics.xlsx") # → markdown table per sheet doc = load_csv("rows.csv") # → markdown table doc = load_tsv("rows.tsv") doc = load_markdown("notes.md") # passthrough doc = load_text("plain.txt") # passthrough doc = load_html("page.html") # → markdown via BeautifulSoup

Document shape

@dataclass class Document: content: str # full markdown metadata: dict[str, Any] # source / format / format-specific

metadata always contains:

  • source. The source file path (str).
  • format. One of "pdf", "docx", "xlsx", "csv", "tsv", "md", "txt", "html".

Format-specific keys are added when relevant: page_count / title / backend / strategy for PDFs, sheet_names for Excel, row_count for CSVs, etc.

Loader output

SourceMarkdown output
PDFElement-aware: titles, paragraphs, lists, tables, per-page sections. Two interchangeable backends. See below.
DOCXHeadings + paragraphs preserved with proper hierarchy.
ExcelOne markdown table per sheet, prefixed with ## SheetName.
CSV / TSVOne big markdown table.
HTMLHeadings + paragraphs + lists preserved; scripts / styles stripped.
Markdown / textPassed through unchanged.

PDF: two interchangeable backends

load_pdf() ships two backends, picked at load time via backend=. Both produce the same Document shape, so downstream code (chunkers, vector stores) doesn’t care which was used.

from loomflow.loader import load_pdf # Default — unstructured, fast strategy doc = load_pdf("research.pdf") # Tune the unstructured strategy doc = load_pdf("scanned.pdf", strategy="ocr_only", languages=["eng", "fra"]) doc = load_pdf("complex_layout.pdf", strategy="hi_res") # Or switch to docling (IBM Research) doc = load_pdf("research.pdf", backend="docling")

backend="unstructured" (default)

Wraps the unstructured library (Apache 2.0; the same engine behind LangChain’s UnstructuredPDFLoader). Element-level parsing with categories, Title / NarrativeText / Table / ListItem / Image. And per-page metadata. tested across thousands of RAG pipelines.

Three strategy modes:

strategy=EngineBest for
"fast" (default)pdfminer.six (pure Python)Native text PDFs. Fast. No model downloads.
"hi_res"YOLO layout detection (unstructured-inference)Multi-column / table-heavy / mixed layouts.
"ocr_only"Tesseract OCRScanned or image-only PDFs.

languages=["eng", "fra"] sets OCR / layout languages. Only meaningful with hi_res and ocr_only. Defaults to English.

backend="docling"

Wraps docling (MIT, IBM Research). ML-based, structure-aware extraction; the 2026 best-in-class benchmark winner for native PDFs. Outputs clean markdown with hierarchy preserved (titles, sections, tables, lists). Slower first run (downloads layout model on first use, then cached); comparable speed afterwards.

The strategy= and languages= kwargs are ignored by the docling backend. It always runs its full layout-aware pipeline.

Picking a backend

You have…Use
Native text PDFs, lots of them, want speed"unstructured" + strategy="fast" (the default)
Multi-column research papers / financial reports"unstructured" + strategy="hi_res"
Scanned / image-only PDFs"unstructured" + strategy="ocr_only" + languages=
Want the best quality on native PDFs and don’t mind a one-time model download"docling"

Failure handling

PDFs vary wildly. When extraction fails on a specific file, load_pdf:

  1. Emits a RuntimeWarning with the backend, strategy, and the underlying exception.
  2. Logs at WARNING to loomflow.loader.pdf.
  3. Returns a non-fatal empty Document with extraction_error in metadata so the pipeline keeps going on bad inputs.
import warnings from loomflow.loader import load_pdf with warnings.catch_warnings(record=True) as w: warnings.simplefilter("always") doc = load_pdf("broken.pdf") if w: print(w[0].message) # "unstructured failed to parse 'broken.pdf' ..." if doc.metadata.get("extraction_error"): handle_unparseable(doc)

Migration from pypdf (pre-0.10). Older versions of the framework shipped a pypdf-backed PDF loader. It silently produced empty text on multi-column or table-heavy PDFs. The classic “questions about content near the end of the PDF go unanswered” symptom. The new loader is a strict upgrade; the public API stayed the same except for the new keyword args. Existing code calling load_pdf(path) keeps working unchanged.

Optional dependencies

pip install 'loomflow[loader-pdf]'

Pulls unstructured[pdf]>=0.15. Covers strategy="fast" (the default) by default. For hi_res / ocr_only, also install Tesseract (system binary) and the model deps:

pip install 'unstructured[pdf,ocr]' # OCR extras # macOS: brew install tesseract poppler # Debian: apt-get install -y tesseract-ocr poppler-utils

load_pdf / load_docx / etc. raise ImportError with the right pip install hint if a dependency isn’t available.

Loading a folder

from pathlib import Path from loomflow.loader import load docs = [load(str(p)) for p in Path("docs/").glob("**/*.pdf")]

The metadata["source"] carries the path so you can disambiguate chunks back to their files later.

For mixed-quality corpora, pass an explicit backend / strategy via load_pdf instead of the auto-dispatch load:

from loomflow.loader import load_pdf docs = [ load_pdf(str(p), strategy="hi_res") for p in Path("scans/").glob("*.pdf") ]

Why markdown? Every chunker downstream expects markdown. It’s the lingua franca that preserves structure (headings, tables, lists) while staying easy for the LLM to read. The loaders normalize once; chunkers and vector stores never need to know the source format.

Next

Chunkers. Splitting the markdown into LLM-friendly pieces.

Last updated on