Why Single Accuracy Scores Mislead in Document Extraction
At Hirize, we’ve spent years building and stress-testing document extraction systems across hundreds of millions of pages. One pattern is consistent across every benchmark we run: document extraction is not a single-metric problem. It’s a multi-dimensional engineering challenge where simple averages hide catastrophic failure modes.
This post is a field guide for engineers who have lived through the grind of parsing messy PDFs, spreadsheets, scanned faxes, and multi-column filings. We’ll break down a rigorous evaluation methodology that goes beyond surface-level accuracy scores.
The Problem With Single Accuracy Scores
Most document AI discussions rely on a single number—usually accuracy or F1 on a small corpus of clean PDFs. That’s comfortable for comparison, but it ignores four independent axes of failure that each make extraction outputs unusable in production:
- Character recognition errors at the glyph level
- Layout interpretation errors including reading order and section boundaries
- Structure loss in tables, lists, outlines, and nested hierarchies
- Semantic corruption from model normalization, substitution, or hallucination
If any one of these fails, the entire document pipeline fails.
A page can have perfect OCR but broken reading order that renders text incoherent to any LLM in a retrieval system. A table can have correct wire detection but incorrect structure that destroys header-to-data relationships. A contract can have accurate paragraphs but orphaned cross-references that make it impossible to answer basic questions.
This is why Hirize evaluates document extraction along multiple orthogonal axes—not just a single accuracy score.
How Hirize Approaches Document Extraction Evaluation
Our document intelligence evaluation framework scores extraction models along orthogonal axes, with dataset slices and decision rules per axis. The goal is to measure not only average performance, but variance under real operating conditions.
We evaluate extraction at multiple nodes in the pipeline. For example, Hirize computes text similarity per region immediately after OCR and again after structure mapping to catch ordering or structure regressions.
Reading Order: The Silent Failure Mode
Reading order breaks whenever models process pages in raster order. Enterprise documents treat pages as physical constraints, not logical boundaries. The result is semantic fragmentation unless the pipeline reconstructs continuity explicitly.
Hirize uses a four-stage reading order subsystem:
1. Pre-segmentation identifies blocks, figures, tables, headers, and footers. We learn a coarse taxonomy of block types to feed different downstream logic.
2. Column graph construction converts spatial relations into a directed acyclic graph encoding containment and adjacency. This prevents naive left-to-right ordering from crossing columns or mixing headers/footers with main content.
3. Cross-page linking carries context—section titles, table headers, figure references—across page boundaries. We maintain state for open sections and tables, similar to a streaming parser.
4. Semantic stitching merges block sequences into paragraphs and sections, preserving hierarchy and reference links.
Reading Order Accuracy Metric
We evaluate reading order as minimum edit distance over block sequences. Let π* be the ground truth ordering and π̂ the predicted ordering
ROAcc = 1 - EditDistance(π̂, π*) / max(|π̂|, |π*|)
We also report Kendall tau rank correlation to capture pairwise inversions.
Hirize slices ROAcc by layout pattern: single column, two column, multi-column with sidebars, and newspaper-style with interstitial artifacts. We report separate scores for cross-page continuity by computing the metric only on edges that cross page boundaries.
Continuity Checks
Hirize runs explicit tests for continuity failure modes:
- Header carry-forward: Verify that table headers detected on page k associate with rows on pages k+1..k+m until a terminating header appears
- Definition resolution: Verify that references like “Section 2.1” link to the correct section token across page boundaries
- Footnote binding: Check that superscript markers link to footnote bodies
Table Extraction: The Hardest Problem in Document AI
Tables fail in ways that don’t show up in simple text metrics like CER or WER. Without structure, correct wire boundaries are useless. Hirize treats table evaluation as both a tree structure and a set of alignment constraints.
Structure Metric: TEDS
We compute Tree Edit Distance-based Similarity (TEDS) between predicted and ground truth table trees. Trees are derived from HTML or Markdown with explicit rowspan and colspan markers.
Alignment Metrics
Hirize adds alignment tests reflecting how downstream systems actually use tables:
- Header-to-column association accuracy across multi-row headers
- Column alignment accuracy for numeric vs. textual columns
- Merged cell preservation rate for rowspan/colspan recovery
- Row grouping consistency for hierarchical row headers
Table structure brittleness is the top source of silent data corruption in document extraction. Numeric columns that shift by one position pass text similarity checks yet break analytics pipelines. Alignment and header association metrics catch this failure class.
Text and Layout Metrics
Hirize computes ANLS (Average Normalized Levenshtein Similarity) at the region level rather than page level. Regions are blocks produced by layout segmentation. This prevents high ANLS from masking incorrect ordering or boundary errors.
We also compute Section Boundary Detection F1 from labeled boundaries—headings, list starts, table starts, figure captions, and appendix markers.
Per-region ANLS prevents layout errors from hiding under high page averages. The region view correlates with human review speed and reduces approval time in QA loops.
Statistical Rigor in Document Extraction Evaluation
Hirize reports uncertainty and significance explicitly:
- Each metric includes a 95% confidence interval computed with non-parametric bootstrap over documents
- For comparing systems A and B, we report the bootstrap distribution of paired metric delta and percentile interval
- We track variance per slice to surface brittleness
A model that wins on average but collapses on multi-column scans is not production-ready. Hirize’s evaluation framework surfaces these failure modes before they reach production.
Performance and Robustness Testing
Latency and throughput matter because enterprise document processing runs at scale. Hirize measures:
- Average and tail latency per page at controlled concurrency levels
- Error rates for corrupted or unsupported formats
- Robustness under controlled noise: resolution drop, JPEG artifacts, mild skew, missing grid lines
Robustness curves are more actionable than single-point scores. Noise sensitivity tests often reveal steep drops at realistic perturbation sizes—engineering targets for preprocessing or model fine-tuning.
Key Insights From Evaluating Document Extraction at Scale
After processing 500M+ pages, these patterns repeat across corpora and inform engineering priorities:
Reading order dominates perceived quality. Users tolerate light OCR noise but reject outputs that read out of order.
Table structure brittleness causes silent data corruption. Numeric columns shifting by one position pass text checks but break analytics. Alignment and header association metrics catch this.
Per-region metrics beat per-page metrics. They correlate better with human review speed and QA effort.
Continuity checks reduce error reports in long documents. Persisting table headers and section state across pages reduces hallucinated headers and orphaned rows.
Robustness curves reveal engineering targets. Sensitivity tests surface where preprocessing or fine-tuning will have the highest impact.
Practical Guidance for Document Extraction Teams
Based on Hirize’s evaluation methodology, here’s what we recommend:
- Score along orthogonal axes: Text similarity, reading order, structure, continuity, and robustness
- Slice by layout archetype: Single column, multi-column, tables with/without lines, scans vs. digital
- Track uncertainty: Report confidence intervals and paired bootstrap deltas
- Prefer per-region metrics: They correlate better with review effort than per-page scores
- Promote continuity to first-class status: Treat cross-page links as ground truth objects
- Evaluate under perturbation: Measure sensitivity to realistic noise and format changes
The Future of Document Extraction Evaluation
Open problems requiring new evaluation methods:
- Cross-modal consistency across text, charts, stamps, and annotations
- Evolving templates that change section order, table shapes, and boilerplate
- Ground truth creation at scale without label drift
- Input integrity for extractors that include language models
- Cost-efficient evaluation that runs continuously on production samples
Building Production-Ready Document Intelligence
Evaluating document extraction isn’t about chasing a single leaderboard score. It’s about proving your system will keep working on the next million documents you haven’t seen yet.
The only way to do that is to measure what actually breaks pipelines in production. Reading order and table structure are the twin pillars of that effort. Treat them as first-class citizens in your evaluation, and your document extraction models will improve where it matters.
At Hirize, this evaluation methodology is built into our document intelligence infrastructure. Every page we process is held to these standards—which is why our customers trust us with their most critical document workflows.
Want to see how Hirize handles complex document extraction? Request a demo or explore our API documentation.


