I Spent Weeks Comparing PDF Extraction Architectures So You Don't Have To

Here's the thing about extracting financial data from PDFs: everyone thinks it's a solved problem until they actually try it.

I was building an AI-powered financial data extraction pipeline — the kind where you're pulling structured numbers from multi-entity consolidated statements, cross-referencing against Excel files, and doing it all under NDA. Real institutional audit workflows. The kind of thing where getting a number wrong isn't "oops", it's a compliance incident.

Our initial approach? Gemini API's File Search with Corpora RAG. Seemed reasonable. Vector store, semantic search, structured extraction. Should work, right?

It was terrible. Not "needs-some-tuning" terrible. Fundamentally, architecturally wrong terrible.

The vector store chunking was destroying table structure. Merged cells, multi-column layouts, hierarchical row groupings — all of it shredded into meaningless text fragments. We were getting numbers back that looked plausible but were pulled from the wrong rows, the wrong columns, sometimes the wrong page entirely.

So I did what any reasonable person would do: I tested 14 different approaches and built a comparison matrix. Because apparently that's who I am now.

The Big Table (a.k.a. The Reason This Article Exists)

#	Option	Platform	Storage	How It Reads	Table Integrity	Cost	Privacy	Latency	Best For
1	Local OCR + LLM	Docling + OpenRouter	Local only	Deterministic OCR → Markdown → LLM	High (97.9%)	~$3/M tokens	Highest — PDF local	~2 min/PDF	Quick local dev
2	Local OCR + Tree Index + LLM	Docling + PageIndex + Pydantic AI	Local; cached tree JSONs	Docling OCR + PageIndex targets ~12K chars to LLM	High	$10–15/run	Highest — PDF local	30–45 min first; 5–10 min cached	Complex multi-institution PDFs
3	Local Tree Index + Vision LLM	PageIndex + Vision via OpenRouter	Local only	Agentic tree nav → page images to Vision LLM	Perfect	$0.005–0.05/page	Highest — PDF local	Slow (vision API)	Complex visual layouts, charts
4	Google AI Studio — Native PDF	Gemini Developer API	Google temp (48h auto-delete)	Native vision — unchunked PDF in 1M context window	Perfect	Lowest (~$0.15/M tokens, cacheable)	Low/Medium	Fast — single call	Rapid prototyping
5	Gemini Flash via OpenRouter	Gemini 2.5 Flash via OpenRouter	No storage	Local text extraction → Gemini Flash	High	Low ($0.15/M + caching)	Medium	Fast	Budget-friendly production
6	Google Vertex AI — Native PDF	Vertex AI on GCP	Private GCS bucket	Native vision — same engine as #4, enterprise infra	Perfect	Lowest	Highest (SOC/HIPAA/FedRAMP)	Fast — enterprise SLAs	Production enterprise
7	Gemini API — File Search (Corpora RAG)	Gemini API File Search	Google Corpora	Managed RAG — chops PDF into chunks → vector DB	Terrible	Low	Low/Medium	Medium	General Q&A — NOT financial extraction
8	OpenAI Responses API — File Search	OpenAI Responses API	OpenAI Vector Store	800-token chunked RAG	Terrible	Variable	Medium (enterprise tiers)	Slow — multi-turn loops	Conversational Q&A — NOT batch validation
9	Claude — Direct PDF Upload	Claude API / Bedrock / Vertex AI	Ephemeral (per-call)	Native vision — ~100pp / 30MB limit	High	High ($3/M input)	Medium–Highest (Bedrock/Vertex)	Medium	Single-doc analysis, under 100 pages
10	LlamaParse + LLM	LlamaParse	LlamaIndex Cloud (48h cache)	Cloud OCR + layout analysis (~6s/doc)	Medium (struggles with currency/footnotes)	Low ($0.003–0.09/page)	Low — PDF uploaded	Fast parsing	Quick cloud OCR
11	Amazon Textract + Bedrock	Textract + Bedrock	Private S3	AnalyzeTables API — cell relationship mapping	High	Medium ($15/1K pages)	Highest (GovCloud)	Medium	AWS-native orgs
12	Azure Doc Intelligence + Azure OpenAI	Azure AI + Azure OpenAI	Private Azure Blob	Layout model ($1.50/1K pages); custom training	High	Medium	Highest (Azure Gov)	Medium	Microsoft-native orgs
13	Unstructured.io + LLM	Unstructured (OSS/SaaS)	Local or Cloud	Partitioning + chunking strategies	Medium	Low ($0.03/page SaaS)	High (self-hosted)	Medium	ETL pipelines
14	Self-hosted RAG with SurrealDB	SurrealDB + HNSW vector search + LangChain / LlamaIndex	Self-hosted or Cloud	Custom RAG — you control chunking, graph relationships	Depends on chunking	Low (self-hosted free; embedding costs)	Highest (self-hosted)	Medium — setup required	Full control, graph relationships

Let me save you some time on the two most important rows:

Option 6 (Vertex AI with native PDF) is highlighted for a reason. Perfect table integrity. Lowest cost. Highest compliance posture. If you're building for enterprise and you're on GCP, stop reading and go implement this.

Options 7 and 8 are highlighted for a very different reason. They are terrible for financial extraction. Not mediocre. Not "fine for most cases." Terrible. And I say this as someone who burned real time and money discovering it the hard way.

Why Vector-Store RAG Fails for Financial Tables

This deserves its own section because it's the most counterintuitive finding — and the one most likely to waste your time if you're coming from a general RAG background.

Vector-store RAG works brilliantly for Q&A over prose documents. It is the wrong architecture for structured financial data.

Here's what happens: the chunking step — the very foundation of the RAG pipeline — tears apart the spatial relationships that give financial tables their meaning. A cell that says "$1,247,893" is meaningless without knowing it belongs to "Net Premium Written" for "Q3 2024" under "Subsidiary A." That context is spatial, not semantic. It lives in the row headers, column headers, and page layout — all of which get destroyed when you chunk text into 512-token windows and embed them into a vector space.

The model then retrieves "relevant" chunks, confidently assembles an answer, and gives you a number that looks right but is from the wrong entity on a different page. If you're lucky, the number is obviously wrong. If you're unlucky (and in financial data, you usually are), it's close enough to pass a cursory check.

This is why Options 4, 6, and 9 work so much better — they process the rendered page as an image. The model "sees" the table as a human would. No chunking, no spatial destruction.

If you're doing what I'm doing — multi-entity consolidated statements, NDA-sensitive documents, cross-validation against source spreadsheets — Option 2 is the move.

How It Works

Docling (IBM Research's document parser) handles OCR locally. No PDF ever leaves your machine. It converts complex layouts into structured markdown with table preservation.
PageIndex builds a tree-structured index — not a vector store. This preserves the page-level and section-level hierarchy that financial documents rely on. The LLM sees only the relevant ~12K characters, not the whole 170-page PDF.
Pydantic AI orchestrates the LLM calls through OpenRouter, enforcing strict output schemas. You define exactly what the extracted data should look like, and the model fills it in.

The Numbers

Dimension	Rating	Detail
Table Integrity	STRONG	97.9% on internal benchmark suite — deterministic OCR preserves tables before text conversion
Navigation Accuracy	STRONG	98.7% on FinanceBench page-level retrieval — multi-signal institution name resolver (vision + text + page continuity)
Cost	MODERATE	~$2–4 tree building (GPT-4o) + ~$5–8 extraction (Claude Sonnet 4 × N institutions) + ~$0.50 resolution (Gemini 2.5 Flash). Total ~$10–15/full run
Privacy	HIGHEST	PDFs never leave local — only extracted text hits LLM APIs
Latency (first run)	SLOW	30–45 min (Docling + PageIndex ETL in parallel + full extraction)
Latency (cached re-run)	FAST	5–10 min (index cached, only LLM calls — re-runs skip tree cost entirely)

That first-run latency is rough, I won't sugarcoat it. 30–45 minutes for a full run is a lot of staring at progress bars. But the cache changes everything — subsequent runs with the same documents drop to 5–10 minutes, which is entirely workable for iterative development.

The key insight: tree-structured indexing preserves the spatial hierarchy that vector stores destroy. It's slower, it's more complex to set up, but for financial data — it's the difference between 97.9% accuracy and garbage.

What Other Options Would Add

Comparison	Benefit	Trade-off
Option 4/6 instead	Eliminates ETL entirely. Native vision = zero OCR errors. 90% caching discount.	Lose local PDF privacy (uploaded to Google). Lose fine-grained section navigation — whole PDF in context; LLM may pick wrong section for similarly-named entities.
Option 14 instead	Graph relationships between institutions, PDFs, and extractions. Persistent queryable index. Custom chunking control.	Significant setup overhead. Overkill for quarterly batch runs unless scaling to many clients.

Platform Access: Who's Available Where

This table matters more than you'd think — especially when your compliance team tells you "it has to run on [specific cloud]."

Model Family	Direct API	Azure
Gemini (Google)	AI Studio
GPT-4o / GPT-4.1 (OpenAI)	OpenAI API	Azure OpenAI
Claude (Anthropic)	Anthropic API
Llama (Meta)	Meta	Azure AI
Mistral

Notice that Claude is the most available across enterprise platforms (Bedrock + Vertex AI), while Gemini is locked to Google's ecosystem. This has real implications for multi-cloud strategies and vendor diversification.

Decision Framework (The Opinionated Version)

"I need this working yesterday"

Options 4 or 6 (Gemini native PDF). Upload the PDF, get structured data back. Perfect table integrity. Under a dollar per run. The catch: your PDFs go to Google. If that's fine, this is the fastest path to a working demo.

"NDAs are involved and the data is sensitive"

Option 2 (Docling + PageIndex + OpenRouter). PDFs stay local. Only extracted text (not the original documents) hits LLM APIs. This is the architecture I'd trust with institutional audit data.

"I want full control over everything"

Option 14 (SurrealDB + custom pipeline). Self-hosted, graph-native data model, total flexibility. High build effort. Worth it if you're building a platform, not a feature.

"Can I combine approaches?"

Yes, and you probably should.

Option 2 + Option 9 (Claude second-pass): Run the Docling pipeline for primary extraction, then use Claude's direct PDF upload via Bedrock for targeted verification of flagged discrepancies. Claude is expensive ($1.50–3 per read) but excellent at single-document deep analysis. Use it surgically, not as your primary extraction engine.
Option 2 + Option 5 (Gemini Flash first-pass): Use Gemini Flash via OpenRouter for cheap initial triage — classify documents, identify page ranges, flag anomalies — then feed the interesting ones into the full Docling pipeline. Cuts cost significantly on high-volume workflows.

What I Actually Learned

After weeks of testing, benchmarking, and occasionally cursing at my terminal, here's what stuck:

1. The "RAG everything" instinct is wrong for structured data. I came in thinking our vector-store approach just needed better chunking or a smarter embedding model. It didn't. The architecture itself was fundamentally mismatched for the problem. Sometimes the answer isn't "tune harder" — it's "choose a different approach."

2. Vision models changed the game. The moment I saw Gemini process a rendered PDF page and correctly extract a 47-row table with merged cells and hierarchical groupings, I realized how much time I'd wasted trying to make text-based extraction work on visually complex layouts.

3. Privacy constraints are actually a useful design forcing function. Being forced to keep PDFs local pushed me toward the Docling + PageIndex architecture, which turned out to be more accurate than the cloud alternatives for complex multi-entity documents. The constraint made the solution better.

4. Cost varies by 30x and it's not correlated with quality. Gemini native PDF at ~$0.50/run achieves perfect table integrity. Textract at $15/1K pages achieves "high." More expensive ≠ better. Do your own benchmarks.

The right architecture depends on your documents, your constraints, and your existing infrastructure. But if you're doing financial extraction with a vector-store RAG pipeline — please stop. I'm saying this with love.

Document Intelligence: Architecture Options Compared for PDF Financial Data Extraction