PDF and Document Ingestion for LLMs
Why document processing matters for LLMs
LLMs consume text, not PDFs. The gap between a PDF on disk and useful LLM context is wider than it looks: scanned documents need OCR, tables need structural extraction, multi-column layouts confuse naive text extractors, and long documents need chunking strategies that preserve semantic coherence. Getting this wrong means your RAG pipeline retrieves garbage and your summarization hallucinates structure that isn't there.
The pipeline is always the same shape: extract raw content from the source format, clean it (normalize encoding, remove boilerplate, fix reading order), chunk it into LLM-sized pieces, and enrich each chunk with metadata (source, page, section). Each step has multiple tool options with different accuracy/speed/cost tradeoffs.
Text extraction with pdfplumber
Use pdfplumber for born-digital PDFs (created from text editors, not scanners) with complex layouts, tables, or multi-column text. It parses the PDF's internal text objects and their positions, giving you fine-grained control over what gets extracted and from where.
Basic text extraction
import pdfplumber
with pdfplumber.open("doc.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
if text:
print(f"Page {page.page_number}: {len(text)} chars")Table extraction
extract_tables() returns a list of tables, each as a list of rows (lists of cell strings). This is far more reliable than trying to parse table structure from raw text.
with pdfplumber.open("report.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for i, table in enumerate(tables):
# table is a list of rows; each row is a list of cell strings
headers = table[0]
rows = table[1:]
print(f"Table {i}: {len(rows)} rows, columns: {headers}")Word-level extraction and column detection
When multi-column layouts scramble reading order, drop down to word-level bounding boxes and sort by position:
with pdfplumber.open("two_column.pdf") as pdf:
page = pdf.pages[0]
words = page.extract_words()
# Words have x0, x1, top, bottom attributes
# Split into left/right columns at the page midpoint
mid_x = page.width / 2
left_col = sorted(
[w for w in words if w["x0"] < mid_x],
key=lambda w: (w["top"], w["x0"]),
)
right_col = sorted(
[w for w in words if w["x0"] >= mid_x],
key=lambda w: (w["top"], w["x0"]),
)
left_text = " ".join(w["text"] for w in left_col)
right_text = " ".join(w["text"] for w in right_col)Region-based extraction
Extract text from a specific bounding box (useful for pulling out specific sections or ignoring headers/footers):
with pdfplumber.open("doc.pdf") as pdf:
page = pdf.pages[0]
# Crop to a specific region: (x0, y0, x1, y1) in PDF points
cropped = page.within_bbox((50, 100, 550, 700))
text = cropped.extract_text()pdfplumber fails on scanned PDFs (images without text layers) and heavily styled documents where visual layout relies on absolute positioning of individual characters rather than text runs.
Text extraction with PyMuPDF (fitz)
Use PyMuPDF when speed matters or when you need both text and images from the same document. It is 5-10x faster than pdfplumber for plain text extraction but less accurate for table structure.
Basic text extraction
import fitz # PyMuPDF
doc = fitz.open("doc.pdf")
for page in doc:
text = page.get_text("text") # plain text
print(f"Page {page.number}: {len(text)} chars")Output formats
PyMuPDF supports multiple output formats from the same page:
page = doc[0]
# Plain text -- fastest, loses all structure
plain = page.get_text("text")
# Structured dict -- blocks with positions, useful for layout analysis
blocks = page.get_text("dict")
for block in blocks["blocks"]:
if block["type"] == 0: # text block
for line in block["lines"]:
for span in line["spans"]:
print(f"Font: {span['font']}, Size: {span['size']}, Text: {span['text']}")
# HTML -- styled HTML, useful for preserving formatting
html = page.get_text("html")Image extraction
Extract embedded images from PDF pages (useful for multimodal pipelines that process both text and images):
doc = fitz.open("doc.pdf")
for page_num, page in enumerate(doc):
for img_index, img in enumerate(page.get_images(full=True)):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n >= 5: # CMYK -- convert to RGB
pix = fitz.Pixmap(fitz.csRGB, pix)
pix.save(f"page{page_num}_img{img_index}.png")OCR with Tesseract
Use Tesseract for scanned PDFs where there is no text layer, only rasterized images. Tesseract is free, runs locally, and supports 100+ languages. The tradeoff is accuracy: it struggles with poor scan quality, unusual fonts, and complex layouts.
Pipeline: rasterize then OCR
The standard pipeline converts each PDF page to an image, then runs OCR on the image:
from PIL import Image
import pytesseract
import pdf2image
# Rasterize PDF pages to images
# DPI matters: 300 is the minimum for reliable OCR; 150 works for large clean fonts
images = pdf2image.convert_from_path("scanned.pdf", dpi=300)
texts = []
for i, img in enumerate(images):
text = pytesseract.image_to_string(img, lang="eng")
texts.append(text)
print(f"Page {i}: {len(text)} chars extracted")Preprocessing for better OCR
Low-quality scans benefit from preprocessing before OCR. The most impactful steps are grayscale conversion, adaptive thresholding, and deskewing:
import cv2
import numpy as np
from PIL import Image
def preprocess_for_ocr(image: Image.Image) -> Image.Image:
"""Preprocess a scanned page image for better Tesseract accuracy."""
img_array = np.array(image)
# Convert to grayscale
gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
# Adaptive thresholding -- handles uneven lighting across the page
thresh = cv2.adaptiveThreshold(
gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2
)
# Deskew: detect rotation angle and correct
coords = np.column_stack(np.where(thresh > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
h, w = thresh.shape
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
deskewed = cv2.warpAffine(
thresh, matrix, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE
)
return Image.fromarray(deskewed)Language packs
Install additional language packs for non-English documents. On Debian/Ubuntu: apt install tesseract-ocr-fra tesseract-ocr-deu. Then specify the language:
# Single language
text = pytesseract.image_to_string(img, lang="fra")
# Multiple languages (Tesseract tries each)
text = pytesseract.image_to_string(img, lang="eng+fra+deu")OCR with AWS Textract
Use Textract in production pipelines where accuracy matters more than cost, or when you need structured table and form extraction. Textract uses deep learning models trained on document layouts and consistently outperforms Tesseract on real-world documents.
Simple text detection
import boto3
client = boto3.client("textract", region_name="us-east-1")
# Synchronous -- single page, bytes input
with open("page.png", "rb") as f:
response = client.detect_document_text(Document={"Bytes": f.read()})
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
print(f"Text: {block['Text']} (Confidence: {block['Confidence']:.1f}%)")Table and form extraction
with open("form.pdf", "rb") as f:
response = client.analyze_document(
Document={"Bytes": f.read()},
FeatureTypes=["TABLES", "FORMS"],
)
# Extract tables
tables = [b for b in response["Blocks"] if b["BlockType"] == "TABLE"]
cells = [b for b in response["Blocks"] if b["BlockType"] == "CELL"]
# Extract key-value pairs from forms
keys = [b for b in response["Blocks"] if b["BlockType"] == "KEY_VALUE_SET" and "KEY" in b.get("EntityTypes", [])]Async processing for multi-page PDFs
For documents stored in S3 with more than one page, use the async API:
# Start async job
job = client.start_document_text_detection(
DocumentLocation={
"S3Object": {"Bucket": "my-bucket", "Name": "docs/large.pdf"}
}
)
job_id = job["JobId"]
# Poll for completion
import time
while True:
result = client.get_document_text_detection(JobId=job_id)
if result["JobStatus"] in ("SUCCEEDED", "FAILED"):
break
time.sleep(5)
# Paginate results
pages = []
next_token = None
while True:
kwargs = {"JobId": job_id}
if next_token:
kwargs["NextToken"] = next_token
result = client.get_document_text_detection(**kwargs)
pages.append(result)
next_token = result.get("NextToken")
if not next_token:
breakCost
Textract pricing (us-east-1): ~$1.50 per 1000 pages for text detection, ~$15 per 1000 pages for table/form analysis. Filter low-confidence blocks (below 90%) to avoid feeding noisy extractions to your LLM.
Chunking strategies
LLMs have context window limits, and RAG retrieval works best with focused chunks that contain a single topic or answer. Naive splitting (every N characters) breaks sentences and loses context. The right chunking strategy depends on your document structure and downstream task.
Fixed-size with overlap
Split every chunk_size characters with overlap characters carried over. Simple, predictable, and sufficient for homogeneous text:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = splitter.split_text(document_text)
# Overlap ensures sentences at boundaries aren't lostRecursive character splitting
Tries to split on paragraph breaks first, then sentences, then words, then characters. This is the default and best general-purpose strategy:
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document_text)The separator hierarchy means the splitter preserves paragraph boundaries wherever possible, only falling back to sentence or word breaks when a paragraph exceeds chunk_size.
Semantic chunking
Split based on embedding similarity. Compute embeddings for sentences, group consecutive sentences whose embeddings are similar, split where similarity drops below a threshold:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings()
splitter = SemanticChunker(
embeddings_model,
breakpoint_threshold_type="percentile",
)
chunks = splitter.create_documents([document_text])Semantic chunking is more expensive (requires embedding each sentence) but produces more coherent chunks for heterogeneous documents -- reports that shift between financial data, narrative, and technical appendices.
Markdown/HTML-aware splitting
When documents have clear heading hierarchy (technical docs, reports with sections), split on headers to preserve document structure:
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_text)
# Each chunk includes metadata about which headers it falls underChunk size guidance
The right chunk size depends on the downstream task:
- RAG retrieval: 500-1000 tokens. Small enough to be specific to a single topic, large enough to contain sufficient context for the LLM to generate a useful answer.
- Summarization: 2000-4000 tokens. Larger chunks let the model see more context at once.
- Always include metadata with each chunk: source document path, page number, section header. This metadata is critical for attribution and debugging retrieval quality.
Gotchas
PDF text extraction order is not guaranteed. Multi-column PDFs may interleave columns when you call extract_text(). Use word-level bounding boxes (extract_words() in pdfplumber or get_text("dict") in PyMuPDF) to detect columns and sort by reading order.
Scanned PDFs with a text layer. Some PDFs have both rasterized images and a text layer. The text layer may be low-quality OCR from the original scanner. Always check extraction quality before assuming the text layer is authoritative -- compare a sample page's extracted text against the visual content.
Character encoding: fi ligatures. The character fi (fi ligature, U+FB01) appears in many PDFs instead of the two characters "fi". This breaks keyword search and confuses tokenizers. Normalize with:
import unicodedata
text = unicodedata.normalize("NFKD", text)
# "finance" becomes "finance"Empty pages. Always filter out chunks that are empty or contain only whitespace before sending to embedding models or LLMs. Empty chunks waste tokens and pollute retrieval results.
Memory with large documents. Processing a 500-page PDF with pdfplumber or PyMuPDF keeps the entire document in memory. For very large documents, process pages in batches:
import fitz
doc = fitz.open("large.pdf")
batch_size = 20
for start in range(0, len(doc), batch_size):
batch_pages = [doc[i] for i in range(start, min(start + batch_size, len(doc)))]
for page in batch_pages:
text = page.get_text("text")
# process text...
# Explicit cleanup if memory is tight
import gc
gc.collect()