Adya Logo

Menu

Close

Back to Projects

Document Intelligence: Architecture Options Compared for PDF Financial Data Extraction

I tested 14 different approaches to extract structured financial data from PDFs. Most of them were bad. Here's what actually works — and what absolutely doesn't.

By Adi Gupta · Mon Dec 22 2025

AIPDFArchitectureFinancial DataLLM

I Spent Weeks Comparing PDF Extraction Architectures So You Don't Have To

Here's the thing about extracting financial data from PDFs: everyone thinks it's a solved problem until they actually try it.

I was building an AI-powered financial data extraction pipeline — the kind where you're pulling structured numbers from multi-entity consolidated statements, cross-referencing against Excel files, and doing it all under NDA. Real institutional audit workflows. The kind of thing where getting a number wrong isn't "oops", it's a compliance incident.

Our initial approach? Gemini API's File Search with Corpora RAG. Seemed reasonable. Vector store, semantic search, structured extraction. Should work, right?

It was terrible. Not "needs-some-tuning" terrible. Fundamentally, architecturally wrong terrible.

The vector store chunking was destroying table structure. Merged cells, multi-column layouts, hierarchical row groupings — all of it shredded into meaningless text fragments. We were getting numbers back that looked plausible but were pulled from the wrong rows, the wrong columns, sometimes the wrong page entirely.

So I did what any reasonable person would do: I tested 14 different approaches and built a comparison matrix. Because apparently that's who I am now.


The Big Table (a.k.a. The Reason This Article Exists)

#OptionPlatformStorageHow It ReadsTable IntegrityCostPrivacyLatencyBest For
1Local OCR + LLMDocling + OpenRouterLocal onlyDeterministic OCR → Markdown → LLMHigh (97.9%)~$3/M tokensHighest — PDF local~2 min/PDFQuick local dev
2Local OCR + Tree Index + LLMDocling + PageIndex + Pydantic AILocal; cached tree JSONsDocling OCR + PageIndex targets ~12K chars to LLMHigh$10–15/runHighest — PDF local30–45 min first; 5–10 min cachedComplex multi-institution PDFs
3Local Tree Index + Vision LLMPageIndex + Vision via OpenRouterLocal onlyAgentic tree nav → page images to Vision LLMPerfect$0.005–0.05/pageHighest — PDF localSlow (vision API)Complex visual layouts, charts
4Google AI Studio — Native PDFGemini Developer APIGoogle temp (48h auto-delete)Native vision — unchunked PDF in 1M context windowPerfectLowest (~$0.15/M tokens, cacheable)Low/MediumFast — single callRapid prototyping
5Gemini Flash via OpenRouterGemini 2.5 Flash via OpenRouterNo storageLocal text extraction → Gemini FlashHighLow ($0.15/M + caching)MediumFastBudget-friendly production
6Google Vertex AI — Native PDFVertex AI on GCPPrivate GCS bucketNative vision — same engine as #4, enterprise infraPerfectLowestHighest (SOC/HIPAA/FedRAMP)Fast — enterprise SLAsProduction enterprise
7Gemini API — File Search (Corpora RAG)Gemini API File SearchGoogle CorporaManaged RAG — chops PDF into chunks → vector DBTerribleLowLow/MediumMediumGeneral Q&A — NOT financial extraction
8OpenAI Responses API — File SearchOpenAI Responses APIOpenAI Vector Store800-token chunked RAGTerribleVariableMedium (enterprise tiers)Slow — multi-turn loopsConversational Q&A — NOT batch validation
9Claude — Direct PDF UploadClaude API / Bedrock / Vertex AIEphemeral (per-call)Native vision — ~100pp / 30MB limitHighHigh ($3/M input)Medium–Highest (Bedrock/Vertex)MediumSingle-doc analysis, under 100 pages
10LlamaParse + LLMLlamaParseLlamaIndex Cloud (48h cache)Cloud OCR + layout analysis (~6s/doc)Medium (struggles with currency/footnotes)Low ($0.003–0.09/page)Low — PDF uploadedFast parsingQuick cloud OCR
11Amazon Textract + BedrockTextract + BedrockPrivate S3AnalyzeTables API — cell relationship mappingHighMedium ($15/1K pages)Highest (GovCloud)MediumAWS-native orgs
12Azure Doc Intelligence + Azure OpenAIAzure AI + Azure OpenAIPrivate Azure BlobLayout model ($1.50/1K pages); custom trainingHighMediumHighest (Azure Gov)MediumMicrosoft-native orgs
13Unstructured.io + LLMUnstructured (OSS/SaaS)Local or CloudPartitioning + chunking strategiesMediumLow ($0.03/page SaaS)High (self-hosted)MediumETL pipelines
14Self-hosted RAG with SurrealDBSurrealDB + HNSW vector search + LangChain / LlamaIndexSelf-hosted or CloudCustom RAG — you control chunking, graph relationshipsDepends on chunkingLow (self-hosted free; embedding costs)Highest (self-hosted)Medium — setup requiredFull control, graph relationships

Let me save you some time on the two most important rows:

Option 6 (Vertex AI with native PDF) is highlighted for a reason. Perfect table integrity. Lowest cost. Highest compliance posture. If you're building for enterprise and you're on GCP, stop reading and go implement this.

Options 7 and 8 are highlighted for a very different reason. They are terrible for financial extraction. Not mediocre. Not "fine for most cases." Terrible. And I say this as someone who burned real time and money discovering it the hard way.


Why Vector-Store RAG Fails for Financial Tables

This deserves its own section because it's the most counterintuitive finding — and the one most likely to waste your time if you're coming from a general RAG background.

Vector-store RAG works brilliantly for Q&A over prose documents. It is the wrong architecture for structured financial data.

Here's what happens: the chunking step — the very foundation of the RAG pipeline — tears apart the spatial relationships that give financial tables their meaning. A cell that says "$1,247,893" is meaningless without knowing it belongs to "Net Premium Written" for "Q3 2024" under "Subsidiary A." That context is spatial, not semantic. It lives in the row headers, column headers, and page layout — all of which get destroyed when you chunk text into 512-token windows and embed them into a vector space.

The model then retrieves "relevant" chunks, confidently assembles an answer, and gives you a number that looks right but is from the wrong entity on a different page. If you're lucky, the number is obviously wrong. If you're unlucky (and in financial data, you usually are), it's close enough to pass a cursory check.

This is why Options 4, 6, and 9 work so much better — they process the rendered page as an image. The model "sees" the table as a human would. No chunking, no spatial destruction.


The One I'd Actually Recommend Building With: Option 2

If you're doing what I'm doing — multi-entity consolidated statements, NDA-sensitive documents, cross-validation against source spreadsheets — Option 2 is the move.

How It Works

  1. Docling (IBM Research's document parser) handles OCR locally. No PDF ever leaves your machine. It converts complex layouts into structured markdown with table preservation.
  2. PageIndex builds a tree-structured index — not a vector store. This preserves the page-level and section-level hierarchy that financial documents rely on. The LLM sees only the relevant ~12K characters, not the whole 170-page PDF.
  3. Pydantic AI orchestrates the LLM calls through OpenRouter, enforcing strict output schemas. You define exactly what the extracted data should look like, and the model fills it in.

The Numbers

DimensionRatingDetail
Table IntegritySTRONG97.9% on internal benchmark suite — deterministic OCR preserves tables before text conversion
Navigation AccuracySTRONG98.7% on FinanceBench page-level retrieval — multi-signal institution name resolver (vision + text + page continuity)
CostMODERATE~$2–4 tree building (GPT-4o) + ~$5–8 extraction (Claude Sonnet 4 × N institutions) + ~$0.50 resolution (Gemini 2.5 Flash). Total ~$10–15/full run
PrivacyHIGHESTPDFs never leave local — only extracted text hits LLM APIs
Latency (first run)SLOW30–45 min (Docling + PageIndex ETL in parallel + full extraction)
Latency (cached re-run)FAST5–10 min (index cached, only LLM calls — re-runs skip tree cost entirely)

That first-run latency is rough, I won't sugarcoat it. 30–45 minutes for a full run is a lot of staring at progress bars. But the cache changes everything — subsequent runs with the same documents drop to 5–10 minutes, which is entirely workable for iterative development.

The key insight: tree-structured indexing preserves the spatial hierarchy that vector stores destroy. It's slower, it's more complex to set up, but for financial data — it's the difference between 97.9% accuracy and garbage.

What Other Options Would Add

ComparisonBenefitTrade-off
Option 4/6 insteadEliminates ETL entirely. Native vision = zero OCR errors. 90% caching discount.Lose local PDF privacy (uploaded to Google). Lose fine-grained section navigation — whole PDF in context; LLM may pick wrong section for similarly-named entities.
Option 14 insteadGraph relationships between institutions, PDFs, and extractions. Persistent queryable index. Custom chunking control.Significant setup overhead. Overkill for quarterly batch runs unless scaling to many clients.

Platform Access: Who's Available Where

This table matters more than you'd think — especially when your compliance team tells you "it has to run on [specific cloud]."

Model FamilyDirect APIOpenRouterAWS BedrockGoogle Vertex AIAzure
Gemini (Google)AI Studio
GPT-4o / GPT-4.1 (OpenAI)OpenAI APIAzure OpenAI
Claude (Anthropic)Anthropic API
Llama (Meta)MetaAzure AI
Mistral

Notice that Claude is the most available across enterprise platforms (Bedrock + Vertex AI), while Gemini is locked to Google's ecosystem. This has real implications for multi-cloud strategies and vendor diversification.


Decision Framework (The Opinionated Version)

"I need this working yesterday"

Options 4 or 6 (Gemini native PDF). Upload the PDF, get structured data back. Perfect table integrity. Under a dollar per run. The catch: your PDFs go to Google. If that's fine, this is the fastest path to a working demo.

"NDAs are involved and the data is sensitive"

Option 2 (Docling + PageIndex + OpenRouter). PDFs stay local. Only extracted text (not the original documents) hits LLM APIs. This is the architecture I'd trust with institutional audit data.

"I want full control over everything"

Option 14 (SurrealDB + custom pipeline). Self-hosted, graph-native data model, total flexibility. High build effort. Worth it if you're building a platform, not a feature.

"Can I combine approaches?"

Yes, and you probably should.

  • Option 2 + Option 9 (Claude second-pass): Run the Docling pipeline for primary extraction, then use Claude's direct PDF upload via Bedrock for targeted verification of flagged discrepancies. Claude is expensive ($1.50–3 per read) but excellent at single-document deep analysis. Use it surgically, not as your primary extraction engine.
  • Option 2 + Option 5 (Gemini Flash first-pass): Use Gemini Flash via OpenRouter for cheap initial triage — classify documents, identify page ranges, flag anomalies — then feed the interesting ones into the full Docling pipeline. Cuts cost significantly on high-volume workflows.

What I Actually Learned

After weeks of testing, benchmarking, and occasionally cursing at my terminal, here's what stuck:

1. The "RAG everything" instinct is wrong for structured data. I came in thinking our vector-store approach just needed better chunking or a smarter embedding model. It didn't. The architecture itself was fundamentally mismatched for the problem. Sometimes the answer isn't "tune harder" — it's "choose a different approach."

2. Vision models changed the game. The moment I saw Gemini process a rendered PDF page and correctly extract a 47-row table with merged cells and hierarchical groupings, I realized how much time I'd wasted trying to make text-based extraction work on visually complex layouts.

3. Privacy constraints are actually a useful design forcing function. Being forced to keep PDFs local pushed me toward the Docling + PageIndex architecture, which turned out to be more accurate than the cloud alternatives for complex multi-entity documents. The constraint made the solution better.

4. Cost varies by 30x and it's not correlated with quality. Gemini native PDF at ~$0.50/run achieves perfect table integrity. Textract at $15/1K pages achieves "high." More expensive ≠ better. Do your own benchmarks.

The right architecture depends on your documents, your constraints, and your existing infrastructure. But if you're doing financial extraction with a vector-store RAG pipeline — please stop. I'm saying this with love.


Sources

  1. Docling — IBM Research document parsing library
  2. OpenRouter — unified LLM gateway API
  3. Pydantic AI — structured output framework for LLM agents
  4. Google Vertex AI Gemini API documentation
  5. Google AI Studio (Gemini Developer API) documentation
  6. Gemini 2.5 Flash model card and pricing
  7. OpenAI Responses API — File Search documentation
  8. Anthropic Claude PDF support documentation
  9. LlamaParse documentation and benchmarks
  10. Amazon Textract pricing and feature documentation
  11. Azure Document Intelligence documentation
  12. Unstructured.io documentation (OSS and SaaS)
  13. SurrealDB documentation — multi-model database
  14. PDF Data Extraction Benchmark 2025 (Procycons)
  15. PDF Table Extraction Showdown: Docling vs LlamaParse vs Unstructured (BoringBot)
  16. PageIndex: Beyond Vectors (TypeVar)
  17. Google Cloud compliance certifications
  18. Claude on Amazon Bedrock
  19. OpenRouter model directory
  20. PageIndex tree-structured document navigation
  21. Vertex AI FedRAMP High (Google Cloud Blog)
  22. OpenAI Assistants vs Responses API comparison (Ragwalla)