Chunkers

Four strategies for breaking a Document.content into LLM-friendly chunks. All produce a list[Chunk] whose metadata points back at the source.

Chunker	Use it when
`RecursiveChunker`	The production workhorse. LangChain-compatible behaviour. Default.
`MarkdownChunker`	Source has clear `#` heading structure (PDFs, DOCX, Excel via the loaders). Preserves the header trail in chunk metadata.
`SentenceChunker`	Sentence-boundary chunks for QA-style RAG over prose.
`TokenChunker`	Chunk by token count via `tiktoken` (lazy import).

One-liner


from loomflow.loader import load, chunk
 
doc = load("research.pdf")
chunks = chunk(doc.content)             # default: RecursiveChunker

Pick one explicitly


from loomflow.loader import load_pdf, MarkdownChunker
 
doc = load_pdf("research.pdf")
chunker = MarkdownChunker(chunk_size=800, chunk_overlap=100)
chunks = chunker.split(doc.content)

All chunkers share the same .split(text, source=None) interface. Pass source= so the chunk’s metadata records its provenance:


chunks = MarkdownChunker().split(doc.content, source=str(doc.metadata["source"]))

Chunk shape


@dataclass
class Chunk:
    content: str
    metadata: dict[str, Any]            # source, chunk_index, headings, ...
    id: str | None = None               # populated by VectorStore.add()

RecursiveChunker

Splits on a hierarchy of separators (paragraph boundary → newline → sentence → space). Most general-purpose, handles arbitrary text well.


from loomflow.loader import RecursiveChunker
 
chunker = RecursiveChunker(
    chunk_size=600,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],   # default
)

Defaults match LangChain’s RecursiveCharacterTextSplitter, so chunks port across systems.

MarkdownChunker

Splits on # heading boundaries. Each chunk records the header trail in metadata, metadata["headings"] is the path of headings the chunk lives under.


from loomflow.loader import MarkdownChunker
 
chunker = MarkdownChunker(chunk_size=800, chunk_overlap=100)
chunks = chunker.split(doc.content)
 
for c in chunks[:3]:
    print(c.metadata["headings"])     # e.g. ["Methods", "Pre-processing"]

Pair with the PDF / DOCX / Excel loaders. They produce well-structured markdown, and MarkdownChunker preserves that structure as metadata the LLM can reason about.

SentenceChunker

Splits on sentence boundaries (. , ? , ! ). Best for prose where keeping individual sentences atomic matters (factual QA, citation- backed answers).


from loomflow.loader import SentenceChunker
 
chunker = SentenceChunker(chunk_size=400, chunk_overlap=1)  # 1-sentence overlap

TokenChunker

Chunks by token count via tiktoken. Useful when the model has a hard context budget and you want to pack chunks to a precise size.


from loomflow.loader import TokenChunker
 
chunker = TokenChunker(
    chunk_size=512,         # tokens, not chars
    chunk_overlap=64,
    encoding="cl100k_base", # gpt-4 / claude default
)

Requires tiktoken:


pip install 'loomflow[loader-token]'

Picking chunk_size. Default 600 chars (~150 tokens) balances recall (smaller chunks → more diverse top-k) against context (each hit pays the chunk’s tokens). Bump to 1000–1500 if your top-k=4 hits keep getting truncated mid-thought.