Chunkers
Four strategies for breaking a Document.content into LLM-friendly
chunks. All produce a list[Chunk] whose metadata points back at
the source.
| Chunker | Use it when |
|---|---|
RecursiveChunker | The production workhorse. LangChain-compatible behaviour. Default. |
MarkdownChunker | Source has clear # heading structure (PDFs, DOCX, Excel via the loaders). Preserves the header trail in chunk metadata. |
SentenceChunker | Sentence-boundary chunks for QA-style RAG over prose. |
TokenChunker | Chunk by token count via tiktoken (lazy import). |
One-liner
from loomflow.loader import load, chunk
doc = load("research.pdf")
chunks = chunk(doc.content) # default: RecursiveChunkerPick one explicitly
from loomflow.loader import load_pdf, MarkdownChunker
doc = load_pdf("research.pdf")
chunker = MarkdownChunker(chunk_size=800, chunk_overlap=100)
chunks = chunker.split(doc.content)All chunkers share the same .split(text, source=None) interface.
Pass source= so the chunk’s metadata records its provenance:
chunks = MarkdownChunker().split(doc.content, source=str(doc.metadata["source"]))Chunk shape
@dataclass
class Chunk:
content: str
metadata: dict[str, Any] # source, chunk_index, headings, ...
id: str | None = None # populated by VectorStore.add()RecursiveChunker
Splits on a hierarchy of separators (paragraph boundary → newline → sentence → space). Most general-purpose, handles arbitrary text well.
from loomflow.loader import RecursiveChunker
chunker = RecursiveChunker(
chunk_size=600,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""], # default
)Defaults match LangChain’s RecursiveCharacterTextSplitter, so
chunks port across systems.
MarkdownChunker
Splits on # heading boundaries. Each chunk records the header trail
in metadata, metadata["headings"] is the path of headings the
chunk lives under.
from loomflow.loader import MarkdownChunker
chunker = MarkdownChunker(chunk_size=800, chunk_overlap=100)
chunks = chunker.split(doc.content)
for c in chunks[:3]:
print(c.metadata["headings"]) # e.g. ["Methods", "Pre-processing"]Pair with the PDF / DOCX / Excel loaders. They produce well-structured
markdown, and MarkdownChunker preserves that structure as metadata
the LLM can reason about.
SentenceChunker
Splits on sentence boundaries (. , ? , ! ). Best for prose where
keeping individual sentences atomic matters (factual QA, citation-
backed answers).
from loomflow.loader import SentenceChunker
chunker = SentenceChunker(chunk_size=400, chunk_overlap=1) # 1-sentence overlapTokenChunker
Chunks by token count via tiktoken. Useful when the model has a hard
context budget and you want to pack chunks to a precise size.
from loomflow.loader import TokenChunker
chunker = TokenChunker(
chunk_size=512, # tokens, not chars
chunk_overlap=64,
encoding="cl100k_base", # gpt-4 / claude default
)Requires tiktoken:
pip install 'loomflow[loader-token]'Picking chunk_size. Default 600 chars (~150 tokens) balances recall (smaller chunks → more diverse top-k) against context (each hit pays the chunk’s tokens). Bump to 1000–1500 if your top-k=4 hits keep getting truncated mid-thought.