← All engineering posts

Document Processing in Production: Why Every Platform Breaks at 60%

By Tammo Mueller

Document Processing in Production: Why Every Platform Breaks at 60%

Quantiva has been building document processing systems for over a decade. Financial services, healthcare, insurance, regulatory compliance. We've shipped pipelines that extract structured data from documents that were never designed to be machine-readable.

We can't share client IP, but we can share the patterns. This series covers the hard parts: table extraction, LLM tradeoffs, layout models, document search, and building review loops that actually improve accuracy over time. This first post frames the problem and maps out what follows.

Why Document Processing Is Harder Than It Looks

Every vendor demo extracts data from a clean PDF and hits 95% accuracy. Then you deploy it on real documents and accuracy drops to 60%.

Take fund annual shareholder reports. The SEC mandates what information must be disclosed, but not how to format it. Every fund manager structures their Statement of Operations, Financial Highlights, and Notes differently. One fund puts the Statement of Operations on page 14 with twelve line items. Another buries it on page 43, combined with Changes in Net Assets into a single table that spans two pages. A third uses a completely different taxonomy for expense categories.

Multiply that by 8,000 registered investment companies, each with their own layout, and you have the real problem: extracting the same logical data points from documents that represent the same information in wildly different ways.

This isn't unique to fund documents. We see the same pattern in insurance claims, medical records, legal contracts, and regulatory filings. The document looks simple to a human. The extraction is brutal to automate.

The Demo-to-Production Gap

Independent benchmarks tell the story. Here's invoice extraction1, one of the simpler document processing problems:

ServiceField AccuracyLine-Item AccuracyCost/1K Pages
AWS Textract78%82%$101
Google Document AI82%40%$10
Azure Document Intelligence93%87%$10
GPT-4o (image input)90.5%63%$8.80

Those numbers are on invoices, and they reflect the models available at the time of the benchmark. Newer model versions will shift specific numbers, but the structural tradeoffs between speed, accuracy, and cost hold. On complex financial tables with merged cells, footnotes, and cross-page continuations, accuracy drops further. We've measured a 33-point gap between vendor benchmarks and production accuracy on fund annual reports.

Document QualityVendor BenchmarkProduction Accuracy
Clean, printed text95-98%95-98%
Faxed / degraded scans90%+76-81%
Complex multi-column layouts95%+40-60%
Mixed-quality real-world intake95%+15-25% degradation

Source: compiled from Vellum2 and AIMultiple3 benchmark data.

A 2% character error rate per processing step compounds across multiple stages into 15-20% information extraction errors2. One in five documents requires human intervention. That's the production reality.

What Each Generation of Technology Solved (and Broke)

We've built systems using every major approach. Each solved something and hit its own wall.

Template matching works at 99.5% accuracy when you have one document layout that never changes. The moment the layout shifts, the system extracts garbage and nobody notices because confidence scores stay high. We counted 47 distinct table structures for the Statement of Operations across one client's fund intake. Template matching doesn't scale to layout variation.

Cloud OCR APIs added real ML behind extraction. But the benchmark numbers above tell the story: accuracy drops significantly on complex layouts, and line-item extraction (the data you actually need from financial tables) is where every service struggles.

Deep learning layout models like LayoutLMv3 understand documents more holistically, processing text, position, and visual features together. We've fine-tuned layout models to 94% field-level F1 on held-out test data. Real progress. But these models process pages independently. A table that continues on the next page, a footnote reference six pages later: single-page models lose that context.

Multimodal LLMs handle section detection and normalization better than any prior approach. They can find the Statement of Operations even when the header text varies between funds. But they hallucinate numbers. On financial documents where a misplaced decimal changes meaning by orders of magnitude, you cannot trust LLM-extracted data without a verification layer. And the economics vary by three orders of magnitude: self-hosted OCR on a GPU runs at roughly $0.09 per thousand pages4, while LLM-based extraction costs 50-100x more depending on the model.

The Hybrid Pipeline

No single technology handles the full problem. The production systems that work combine approaches:

Hybrid document processing pipeline showing classification, extraction routing, confidence calibration, and validation stages

  1. Classify first. Route documents to the right extraction path. A lightweight classifier determines whether a page contains a Statement of Operations, Financial Highlights, or narrative prose. Get this wrong and everything downstream fails.

  2. Rules where the problem is deterministic. Template-based extraction for standardized forms where the layout is mandated by regulation. Rules are not legacy technology.

  3. ML where layouts vary. Fine-tuned layout models for high-volume document types. LLM-based extraction for the long tail of formats you haven't seen before.

  4. Calibrated confidence routing. Neural networks are systematically overconfident. A model reporting 90% confidence might be correct only 70% of the time. Temperature scaling calibrates the scores. Low-confidence extractions go to human review.

  5. Validation against structured data. For SEC filings, XBRL provides ground truth. Extract from the PDF, cross-reference against XBRL tags, flag discrepancies.

A platform that extracts 95% of fields correctly but provides no tooling for the remaining 5% creates more work than one that extracts 90% and routes exceptions to reviewers with full context. The review loop is not a failure of the system. It is the system.

What This Series Covers

Seven posts, each on a specific hard problem:

  1. This post. The landscape, the demo-to-production gap, and why hybrid pipelines win.
  2. Table extraction. TATR, CascadeTabNet, merged cells, borderless tables, cross-page continuations.
  3. LLMs vs. traditional pipelines. Head-to-head benchmarks with real cost, latency, and accuracy numbers.
  4. SEC EDGAR and XBRL parsing. 30 years of format evolution, working Python code for extracting standardized financials.
  5. Human-in-the-loop and active learning. Confidence calibration, acquisition functions, building review loops that improve the model.
  6. LayoutLMv3, OCR preprocessing, and document understanding. Architecture, fine-tuning, preprocessing that moved OCR accuracy from 89% to 97%.
  7. Document search and RAG. Chunking strategies, hybrid retrieval, and why naive embeddings fail on document collections.

Every post includes architecture diagrams, benchmark tables, and code where it proves a point.

Frequently Asked Questions

What is intelligent document processing and how does it differ from OCR?

OCR converts images of text into machine-readable characters. Intelligent document processing classifies documents, understands layout structure, extracts specific data fields, validates results against business rules, and routes exceptions for human review. OCR is one component in an IDP pipeline, not a synonym for it.

Can LLMs replace traditional document processing pipelines?

For variable-layout documents at moderate volumes, yes. For high-volume processing or strict latency requirements, traditional OCR is orders of magnitude cheaper per page4. Most production systems use both: LLMs for the long tail of formats, fine-tuned models for high-volume document types, and rules for anything with a fixed layout.

What accuracy should I expect from off-the-shelf document extraction?

On clean, single-layout documents: 95%+. On complex financial tables with merged cells, footnotes, and cross-page continuations: 40-60% from off-the-shelf services. Fine-tuned models and hybrid pipelines push this to 90%+ for known document types, with human review handling the rest.

Why do neural networks need confidence calibration for document processing?

Neural networks are systematically overconfident. A model reporting 90% confidence on an extraction might be correct only 70% of the time. Calibration techniques like temperature scaling and Platt scaling correct the probability estimates so you can set meaningful thresholds for routing low-confidence extractions to human review.

References

Footnotes

  1. Businessware: AI Services for Automatic Invoice Processing Benchmark

  2. Vellum: Document Data Extraction, LLMs vs OCRs 2

  3. AIMultiple: OCR Accuracy Benchmark

  4. Berkeley: Templatized Document Extraction Benchmark 2

Document AIIntelligent Document ProcessingOCRPDF ExtractionMachine Learning