Insights

Why Single Accuracy Scores Mislead in Document Extraction

admin

February 4, 2026
7 Min Read

Why Single Accuracy Scores Mislead in Document Extraction

At Hirize, we’ve spent years building and stress-testing document extraction systems across hundreds of millions of pages. One pattern is consistent across every benchmark we run: document extraction is not a single-metric problem. It’s a multi-dimensional engineering challenge where simple averages hide catastrophic failure modes.

This post is a field guide for engineers who have lived through the grind of parsing messy PDFs, spreadsheets, scanned faxes, and multi-column filings. We’ll break down a rigorous evaluation methodology that goes beyond surface-level accuracy scores.

The Problem With Single Accuracy Scores

Most document AI discussions rely on a single number—usually accuracy or F1 on a small corpus of clean PDFs. That’s comfortable for comparison, but it ignores four independent axes of failure that each make extraction outputs unusable in production:

Character recognition errors at the glyph level
Layout interpretation errors including reading order and section boundaries
Structure loss in tables, lists, outlines, and nested hierarchies
Semantic corruption from model normalization, substitution, or hallucination

If any one of these fails, the entire document pipeline fails.

A page can have perfect OCR but broken reading order that renders text incoherent to any LLM in a retrieval system. A table can have correct wire detection but incorrect structure that destroys header-to-data relationships. A contract can have accurate paragraphs but orphaned cross-references that make it impossible to answer basic questions.

This is why Hirize evaluates document extraction along multiple orthogonal axes—not just a single accuracy score.

How Hirize Approaches Document Extraction Evaluation

Our document intelligence evaluation framework scores extraction models along orthogonal axes, with dataset slices and decision rules per axis. The goal is to measure not only average performance, but variance under real operating conditions.

We evaluate extraction at multiple nodes in the pipeline. For example, Hirize computes text similarity per region immediately after OCR and again after structure mapping to catch ordering or structure regressions.

Reading Order: The Silent Failure Mode

Reading order breaks whenever models process pages in raster order. Enterprise documents treat pages as physical constraints, not logical boundaries. The result is semantic fragmentation unless the pipeline reconstructs continuity explicitly.

Hirize uses a four-stage reading order subsystem:

1. Pre-segmentation identifies blocks, figures, tables, headers, and footers. We learn a coarse taxonomy of block types to feed different downstream logic.

2. Column graph construction converts spatial relations into a directed acyclic graph encoding containment and adjacency. This prevents naive left-to-right ordering from crossing columns or mixing headers/footers with main content.

3. Cross-page linking carries context—section titles, table headers, figure references—across page boundaries. We maintain state for open sections and tables, similar to a streaming parser.

4. Semantic stitching merges block sequences into paragraphs and sections, preserving hierarchy and reference links.

Reading Order Accuracy Metric

We evaluate reading order as minimum edit distance over block sequences. Let π* be the ground truth ordering and π̂ the predicted ordering

ROAcc = 1 - EditDistance(π̂, π*) / max(|π̂|, |π*|)

We also report Kendall tau rank correlation to capture pairwise inversions.

Hirize slices ROAcc by layout pattern: single column, two column, multi-column with sidebars, and newspaper-style with interstitial artifacts. We report separate scores for cross-page continuity by computing the metric only on edges that cross page boundaries.

Continuity Checks

Hirize runs explicit tests for continuity failure modes:

Header carry-forward: Verify that table headers detected on page k associate with rows on pages k+1..k+m until a terminating header appears
Definition resolution: Verify that references like “Section 2.1” link to the correct section token across page boundaries
Footnote binding: Check that superscript markers link to footnote bodies

Table Extraction: The Hardest Problem in Document AI

Tables fail in ways that don’t show up in simple text metrics like CER or WER. Without structure, correct wire boundaries are useless. Hirize treats table evaluation as both a tree structure and a set of alignment constraints.

Structure Metric: TEDS

We compute Tree Edit Distance-based Similarity (TEDS) between predicted and ground truth table trees. Trees are derived from HTML or Markdown with explicit rowspan and colspan markers.

Alignment Metrics

Hirize adds alignment tests reflecting how downstream systems actually use tables:

Header-to-column association accuracy across multi-row headers
Column alignment accuracy for numeric vs. textual columns
Merged cell preservation rate for rowspan/colspan recovery
Row grouping consistency for hierarchical row headers

Table structure brittleness is the top source of silent data corruption in document extraction. Numeric columns that shift by one position pass text similarity checks yet break analytics pipelines. Alignment and header association metrics catch this failure class.

Text and Layout Metrics

Hirize computes ANLS (Average Normalized Levenshtein Similarity) at the region level rather than page level. Regions are blocks produced by layout segmentation. This prevents high ANLS from masking incorrect ordering or boundary errors.

We also compute Section Boundary Detection F1 from labeled boundaries—headings, list starts, table starts, figure captions, and appendix markers.

Per-region ANLS prevents layout errors from hiding under high page averages. The region view correlates with human review speed and reduces approval time in QA loops.

Statistical Rigor in Document Extraction Evaluation

Hirize reports uncertainty and significance explicitly:

Each metric includes a 95% confidence interval computed with non-parametric bootstrap over documents
For comparing systems A and B, we report the bootstrap distribution of paired metric delta and percentile interval
We track variance per slice to surface brittleness

A model that wins on average but collapses on multi-column scans is not production-ready. Hirize’s evaluation framework surfaces these failure modes before they reach production.

Performance and Robustness Testing

Latency and throughput matter because enterprise document processing runs at scale. Hirize measures:

Average and tail latency per page at controlled concurrency levels
Error rates for corrupted or unsupported formats
Robustness under controlled noise: resolution drop, JPEG artifacts, mild skew, missing grid lines

Robustness curves are more actionable than single-point scores. Noise sensitivity tests often reveal steep drops at realistic perturbation sizes—engineering targets for preprocessing or model fine-tuning.

Key Insights From Evaluating Document Extraction at Scale

After processing 500M+ pages, these patterns repeat across corpora and inform engineering priorities:

Reading order dominates perceived quality. Users tolerate light OCR noise but reject outputs that read out of order.

Table structure brittleness causes silent data corruption. Numeric columns shifting by one position pass text checks but break analytics. Alignment and header association metrics catch this.

Per-region metrics beat per-page metrics. They correlate better with human review speed and QA effort.

Continuity checks reduce error reports in long documents. Persisting table headers and section state across pages reduces hallucinated headers and orphaned rows.

Robustness curves reveal engineering targets. Sensitivity tests surface where preprocessing or fine-tuning will have the highest impact.

Practical Guidance for Document Extraction Teams

Based on Hirize’s evaluation methodology, here’s what we recommend:

Score along orthogonal axes: Text similarity, reading order, structure, continuity, and robustness
Slice by layout archetype: Single column, multi-column, tables with/without lines, scans vs. digital
Track uncertainty: Report confidence intervals and paired bootstrap deltas
Prefer per-region metrics: They correlate better with review effort than per-page scores
Promote continuity to first-class status: Treat cross-page links as ground truth objects
Evaluate under perturbation: Measure sensitivity to realistic noise and format changes

The Future of Document Extraction Evaluation

Open problems requiring new evaluation methods:

Cross-modal consistency across text, charts, stamps, and annotations
Evolving templates that change section order, table shapes, and boilerplate
Ground truth creation at scale without label drift
Input integrity for extractors that include language models
Cost-efficient evaluation that runs continuously on production samples

Building Production-Ready Document Intelligence

Evaluating document extraction isn’t about chasing a single leaderboard score. It’s about proving your system will keep working on the next million documents you haven’t seen yet.

The only way to do that is to measure what actually breaks pipelines in production. Reading order and table structure are the twin pillars of that effort. Treat them as first-class citizens in your evaluation, and your document extraction models will improve where it matters.

At Hirize, this evaluation methodology is built into our document intelligence infrastructure. Every page we process is held to these standards—which is why our customers trust us with their most critical document workflows.

Want to see how Hirize handles complex document extraction? Request a demo or explore our API documentation.

Why Single Accuracy Scores Mislead in Document Extraction

admin

The Problem With Single Accuracy Scores

How Hirize Approaches Document Extraction Evaluation