I Spent Weeks Comparing PDF Extraction Architectures So You Don't Have To
Here's the thing about extracting financial data from PDFs: everyone thinks it's a solved problem until they actually try it.
I was building an AI-powered financial data extraction pipeline — the kind where you're pulling structured numbers from multi-entity consolidated statements, cross-referencing against Excel files, and doing it all under NDA. Real institutional audit workflows. The kind of thing where getting a number wrong isn't "oops", it's a compliance incident.
Our initial approach? Gemini API's File Search with Corpora RAG. Seemed reasonable. Vector store, semantic search, structured extraction. Should work, right?
It was terrible. Not "needs-some-tuning" terrible. Fundamentally, architecturally wrong terrible.
The vector store chunking was destroying table structure. Merged cells, multi-column layouts, hierarchical row groupings — all of it shredded into meaningless text fragments. We were getting numbers back that looked plausible but were pulled from the wrong rows, the wrong columns, sometimes the wrong page entirely.
So I did what any reasonable person would do: I tested 14 different approaches and built a comparison matrix. Because apparently that's who I am now.
The Big Table (a.k.a. The Reason This Article Exists)
| # | Option | Platform | Storage | How It Reads | Table Integrity | Cost | Privacy | Latency | Best For |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Local OCR + LLM | Docling + OpenRouter | Local only | Deterministic OCR → Markdown → LLM | High (97.9%) | ~$3/M tokens | Highest — PDF local | ~2 min/PDF | Quick local dev |
| 2 | Local OCR + Tree Index + LLM | Docling + PageIndex + Pydantic AI | Local; cached tree JSONs | Docling OCR + PageIndex targets ~12K chars to LLM | High | $10–15/run | Highest — PDF local | 30–45 min first; 5–10 min cached | Complex multi-institution PDFs |
| 3 | Local Tree Index + Vision LLM | PageIndex + Vision via OpenRouter | Local only | Agentic tree nav → page images to Vision LLM | Perfect | $0.005–0.05/page | Highest — PDF local | Slow (vision API) | Complex visual layouts, charts |
| 4 | Google AI Studio — Native PDF | Gemini Developer API | Google temp (48h auto-delete) | Native vision — unchunked PDF in 1M context window | Perfect | Lowest (~$0.15/M tokens, cacheable) | Low/Medium | Fast — single call | Rapid prototyping |
| 5 | Gemini Flash via OpenRouter | Gemini 2.5 Flash via OpenRouter | No storage | Local text extraction → Gemini Flash | High | Low ($0.15/M + caching) | Medium | Fast | Budget-friendly production |
| 6 | Google Vertex AI — Native PDF | Vertex AI on GCP | Private GCS bucket | Native vision — same engine as #4, enterprise infra | Perfect | Lowest | Highest (SOC/HIPAA/FedRAMP) | Fast — enterprise SLAs | Production enterprise |
| 7 | Gemini API — File Search (Corpora RAG) | Gemini API File Search | Google Corpora | Managed RAG — chops PDF into chunks → vector DB | Terrible | Low | Low/Medium | Medium | General Q&A — NOT financial extraction |
| 8 | OpenAI Responses API — File Search | OpenAI Responses API | OpenAI Vector Store | 800-token chunked RAG | Terrible | Variable | Medium (enterprise tiers) | Slow — multi-turn loops | Conversational Q&A — NOT batch validation |
| 9 | Claude — Direct PDF Upload | Claude API / Bedrock / Vertex AI | Ephemeral (per-call) | Native vision — ~100pp / 30MB limit | High | High ($3/M input) | Medium–Highest (Bedrock/Vertex) | Medium | Single-doc analysis, under 100 pages |
| 10 | LlamaParse + LLM | LlamaParse | LlamaIndex Cloud (48h cache) | Cloud OCR + layout analysis (~6s/doc) | Medium (struggles with currency/footnotes) | Low ($0.003–0.09/page) | Low — PDF uploaded | Fast parsing | Quick cloud OCR |
| 11 | Amazon Textract + Bedrock | Textract + Bedrock | Private S3 | AnalyzeTables API — cell relationship mapping | High | Medium ($15/1K pages) | Highest (GovCloud) | Medium | AWS-native orgs |
| 12 | Azure Doc Intelligence + Azure OpenAI | Azure AI + Azure OpenAI | Private Azure Blob | Layout model ($1.50/1K pages); custom training | High | Medium | Highest (Azure Gov) | Medium | Microsoft-native orgs |
| 13 | Unstructured.io + LLM | Unstructured (OSS/SaaS) | Local or Cloud | Partitioning + chunking strategies | Medium | Low ($0.03/page SaaS) | High (self-hosted) | Medium | ETL pipelines |
| 14 | Self-hosted RAG with SurrealDB | SurrealDB + HNSW vector search + LangChain / LlamaIndex | Self-hosted or Cloud | Custom RAG — you control chunking, graph relationships | Depends on chunking | Low (self-hosted free; embedding costs) | Highest (self-hosted) | Medium — setup required | Full control, graph relationships |
Let me save you some time on the two most important rows:
Option 6 (Vertex AI with native PDF) is highlighted for a reason. Perfect table integrity. Lowest cost. Highest compliance posture. If you're building for enterprise and you're on GCP, stop reading and go implement this.
Options 7 and 8 are highlighted for a very different reason. They are terrible for financial extraction. Not mediocre. Not "fine for most cases." Terrible. And I say this as someone who burned real time and money discovering it the hard way.
Why Vector-Store RAG Fails for Financial Tables
This deserves its own section because it's the most counterintuitive finding — and the one most likely to waste your time if you're coming from a general RAG background.
Vector-store RAG works brilliantly for Q&A over prose documents. It is the wrong architecture for structured financial data.
Here's what happens: the chunking step — the very foundation of the RAG pipeline — tears apart the spatial relationships that give financial tables their meaning. A cell that says "$1,247,893" is meaningless without knowing it belongs to "Net Premium Written" for "Q3 2024" under "Subsidiary A." That context is spatial, not semantic. It lives in the row headers, column headers, and page layout — all of which get destroyed when you chunk text into 512-token windows and embed them into a vector space.
The model then retrieves "relevant" chunks, confidently assembles an answer, and gives you a number that looks right but is from the wrong entity on a different page. If you're lucky, the number is obviously wrong. If you're unlucky (and in financial data, you usually are), it's close enough to pass a cursory check.
This is why Options 4, 6, and 9 work so much better — they process the rendered page as an image. The model "sees" the table as a human would. No chunking, no spatial destruction.
The One I'd Actually Recommend Building With: Option 2
If you're doing what I'm doing — multi-entity consolidated statements, NDA-sensitive documents, cross-validation against source spreadsheets — Option 2 is the move.
How It Works
- Docling (IBM Research's document parser) handles OCR locally. No PDF ever leaves your machine. It converts complex layouts into structured markdown with table preservation.
- PageIndex builds a tree-structured index — not a vector store. This preserves the page-level and section-level hierarchy that financial documents rely on. The LLM sees only the relevant ~12K characters, not the whole 170-page PDF.
- Pydantic AI orchestrates the LLM calls through OpenRouter, enforcing strict output schemas. You define exactly what the extracted data should look like, and the model fills it in.
The Numbers
| Dimension | Rating | Detail |
|---|---|---|
| Table Integrity | STRONG | 97.9% on internal benchmark suite — deterministic OCR preserves tables before text conversion |
| Navigation Accuracy | STRONG | 98.7% on FinanceBench page-level retrieval — multi-signal institution name resolver (vision + text + page continuity) |
| Cost | MODERATE | ~$2–4 tree building (GPT-4o) + ~$5–8 extraction (Claude Sonnet 4 × N institutions) + ~$0.50 resolution (Gemini 2.5 Flash). Total ~$10–15/full run |
| Privacy | HIGHEST | PDFs never leave local — only extracted text hits LLM APIs |
| Latency (first run) | SLOW | 30–45 min (Docling + PageIndex ETL in parallel + full extraction) |
| Latency (cached re-run) | FAST | 5–10 min (index cached, only LLM calls — re-runs skip tree cost entirely) |
That first-run latency is rough, I won't sugarcoat it. 30–45 minutes for a full run is a lot of staring at progress bars. But the cache changes everything — subsequent runs with the same documents drop to 5–10 minutes, which is entirely workable for iterative development.
The key insight: tree-structured indexing preserves the spatial hierarchy that vector stores destroy. It's slower, it's more complex to set up, but for financial data — it's the difference between 97.9% accuracy and garbage.
What Other Options Would Add
| Comparison | Benefit | Trade-off |
|---|---|---|
| Option 4/6 instead | Eliminates ETL entirely. Native vision = zero OCR errors. 90% caching discount. | Lose local PDF privacy (uploaded to Google). Lose fine-grained section navigation — whole PDF in context; LLM may pick wrong section for similarly-named entities. |
| Option 14 instead | Graph relationships between institutions, PDFs, and extractions. Persistent queryable index. Custom chunking control. | Significant setup overhead. Overkill for quarterly batch runs unless scaling to many clients. |
Platform Access: Who's Available Where
This table matters more than you'd think — especially when your compliance team tells you "it has to run on [specific cloud]."
| Model Family | Direct API | OpenRouter | AWS Bedrock | Google Vertex AI | Azure |
|---|---|---|---|---|---|
| Gemini (Google) | AI Studio | ||||
| GPT-4o / GPT-4.1 (OpenAI) | OpenAI API | Azure OpenAI | |||
| Claude (Anthropic) | Anthropic API | ||||
| Llama (Meta) | Meta | Azure AI | |||
| Mistral |
Notice that Claude is the most available across enterprise platforms (Bedrock + Vertex AI), while Gemini is locked to Google's ecosystem. This has real implications for multi-cloud strategies and vendor diversification.
Decision Framework (The Opinionated Version)
"I need this working yesterday"
Options 4 or 6 (Gemini native PDF). Upload the PDF, get structured data back. Perfect table integrity. Under a dollar per run. The catch: your PDFs go to Google. If that's fine, this is the fastest path to a working demo.
"NDAs are involved and the data is sensitive"
Option 2 (Docling + PageIndex + OpenRouter). PDFs stay local. Only extracted text (not the original documents) hits LLM APIs. This is the architecture I'd trust with institutional audit data.
"I want full control over everything"
Option 14 (SurrealDB + custom pipeline). Self-hosted, graph-native data model, total flexibility. High build effort. Worth it if you're building a platform, not a feature.
"Can I combine approaches?"
Yes, and you probably should.
- Option 2 + Option 9 (Claude second-pass): Run the Docling pipeline for primary extraction, then use Claude's direct PDF upload via Bedrock for targeted verification of flagged discrepancies. Claude is expensive ($1.50–3 per read) but excellent at single-document deep analysis. Use it surgically, not as your primary extraction engine.
- Option 2 + Option 5 (Gemini Flash first-pass): Use Gemini Flash via OpenRouter for cheap initial triage — classify documents, identify page ranges, flag anomalies — then feed the interesting ones into the full Docling pipeline. Cuts cost significantly on high-volume workflows.
What I Actually Learned
After weeks of testing, benchmarking, and occasionally cursing at my terminal, here's what stuck:
1. The "RAG everything" instinct is wrong for structured data. I came in thinking our vector-store approach just needed better chunking or a smarter embedding model. It didn't. The architecture itself was fundamentally mismatched for the problem. Sometimes the answer isn't "tune harder" — it's "choose a different approach."
2. Vision models changed the game. The moment I saw Gemini process a rendered PDF page and correctly extract a 47-row table with merged cells and hierarchical groupings, I realized how much time I'd wasted trying to make text-based extraction work on visually complex layouts.
3. Privacy constraints are actually a useful design forcing function. Being forced to keep PDFs local pushed me toward the Docling + PageIndex architecture, which turned out to be more accurate than the cloud alternatives for complex multi-entity documents. The constraint made the solution better.
4. Cost varies by 30x and it's not correlated with quality. Gemini native PDF at ~$0.50/run achieves perfect table integrity. Textract at $15/1K pages achieves "high." More expensive ≠ better. Do your own benchmarks.
The right architecture depends on your documents, your constraints, and your existing infrastructure. But if you're doing financial extraction with a vector-store RAG pipeline — please stop. I'm saying this with love.
Sources
- Docling — IBM Research document parsing library
- OpenRouter — unified LLM gateway API
- Pydantic AI — structured output framework for LLM agents
- Google Vertex AI Gemini API documentation
- Google AI Studio (Gemini Developer API) documentation
- Gemini 2.5 Flash model card and pricing
- OpenAI Responses API — File Search documentation
- Anthropic Claude PDF support documentation
- LlamaParse documentation and benchmarks
- Amazon Textract pricing and feature documentation
- Azure Document Intelligence documentation
- Unstructured.io documentation (OSS and SaaS)
- SurrealDB documentation — multi-model database
- PDF Data Extraction Benchmark 2025 (Procycons)
- PDF Table Extraction Showdown: Docling vs LlamaParse vs Unstructured (BoringBot)
- PageIndex: Beyond Vectors (TypeVar)
- Google Cloud compliance certifications
- Claude on Amazon Bedrock
- OpenRouter model directory
- PageIndex tree-structured document navigation
- Vertex AI FedRAMP High (Google Cloud Blog)
- OpenAI Assistants vs Responses API comparison (Ragwalla)