Going from PDF to Chunks the smart way
I got asked on yesterdays call about how to take a PDF into a more consistent way into chunks for RAG. The first challenge you have with converting any PDF file is dealing with the unique underlying way that the PDF document may be formated. Much of that formating has no impact on the printed output but does have an impact if you are using python to extract with Langchain making the output often inconsistent with sections often being wrongly aggregated for the chunking process.
A better approach that has worked consistantly for me is to first convert the PDF into Markdown then convert the Markdown into chunking see:
Step One:
import pymupdf4llm
import pathlib
# Convert PDF to markdown
md_text = pymupdf4llm.to_markdown("input.pdf")
# Save markdown to file (optional) perhaps just save as a string
pathlib.Path("output.md").write_bytes(md_text.encode())
Step Two:
from langchain_text_splitters
import MarkdownHeaderTextSplitter
# Define headers to split on
headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ]
# Initialize splitter
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
# Split the markdown text
md_header_splits = markdown_splitter.split_text(md_text)
2
3 comments
Paul Miller
4
Going from PDF to Chunks the smart way
AI Developer Accelerator
skool.com/ai-developer-accelerator
Master AI & software development to build apps and unlock new income streams. Transform ideas into profits. 💡➕🤖➕👨‍💻🟰💰
Leaderboard (30-day)
powered by