Skip to Content
DocsRAGChunkers

Chunkers

Four strategies for breaking a Document.content into LLM-friendly chunks. All produce a list[Chunk] whose metadata points back at the source.

ChunkerUse it when
RecursiveChunkerThe production workhorse. LangChain-compatible behaviour. Default.
MarkdownChunkerSource has clear # heading structure (PDFs, DOCX, Excel via the loaders). Preserves the header trail in chunk metadata.
SentenceChunkerSentence-boundary chunks for QA-style RAG over prose.
TokenChunkerChunk by token count via tiktoken (lazy import).

One-liner

from loomflow.loader import load, chunk doc = load("research.pdf") chunks = chunk(doc.content) # default: RecursiveChunker

Pick one explicitly

from loomflow.loader import load_pdf, MarkdownChunker doc = load_pdf("research.pdf") chunker = MarkdownChunker(chunk_size=800, chunk_overlap=100) chunks = chunker.split(doc.content)

All chunkers share the same .split(text, source=None) interface. Pass source= so the chunk’s metadata records its provenance:

chunks = MarkdownChunker().split(doc.content, source=str(doc.metadata["source"]))

Chunk shape

@dataclass class Chunk: content: str metadata: dict[str, Any] # source, chunk_index, headings, ... id: str | None = None # populated by VectorStore.add()

RecursiveChunker

Splits on a hierarchy of separators (paragraph boundary → newline → sentence → space). Most general-purpose, handles arbitrary text well.

from loomflow.loader import RecursiveChunker chunker = RecursiveChunker( chunk_size=600, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""], # default )

Defaults match LangChain’s RecursiveCharacterTextSplitter, so chunks port across systems.

MarkdownChunker

Splits on # heading boundaries. Each chunk records the header trail in metadata, metadata["headings"] is the path of headings the chunk lives under.

from loomflow.loader import MarkdownChunker chunker = MarkdownChunker(chunk_size=800, chunk_overlap=100) chunks = chunker.split(doc.content) for c in chunks[:3]: print(c.metadata["headings"]) # e.g. ["Methods", "Pre-processing"]

Pair with the PDF / DOCX / Excel loaders. They produce well-structured markdown, and MarkdownChunker preserves that structure as metadata the LLM can reason about.

SentenceChunker

Splits on sentence boundaries (. , ? , ! ). Best for prose where keeping individual sentences atomic matters (factual QA, citation- backed answers).

from loomflow.loader import SentenceChunker chunker = SentenceChunker(chunk_size=400, chunk_overlap=1) # 1-sentence overlap

TokenChunker

Chunks by token count via tiktoken. Useful when the model has a hard context budget and you want to pack chunks to a precise size.

from loomflow.loader import TokenChunker chunker = TokenChunker( chunk_size=512, # tokens, not chars chunk_overlap=64, encoding="cl100k_base", # gpt-4 / claude default )

Requires tiktoken:

pip install 'loomflow[loader-token]'

Picking chunk_size. Default 600 chars (~150 tokens) balances recall (smaller chunks → more diverse top-k) against context (each hit pays the chunk’s tokens). Bump to 1000–1500 if your top-k=4 hits keep getting truncated mid-thought.

Last updated on