{"id":1409,"date":"2026-02-04T17:45:37","date_gmt":"2026-02-04T17:45:37","guid":{"rendered":"https:\/\/blog.hirize.ai\/?p=1409"},"modified":"2026-02-04T17:45:37","modified_gmt":"2026-02-04T17:45:37","slug":"why-accuracy-scores-mislead-document-extraction","status":"publish","type":"post","link":"https:\/\/blog.hirize.ai\/index.php\/2026\/02\/04\/why-accuracy-scores-mislead-document-extraction\/","title":{"rendered":"Why Single Accuracy Scores Mislead in Document Extraction"},"content":{"rendered":"<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">At Hirize, we&#8217;ve spent years building and stress-testing document extraction systems across hundreds of millions of pages. One pattern is consistent across every benchmark we run: document extraction is not a single-metric problem. It&#8217;s a multi-dimensional engineering challenge where simple averages hide catastrophic failure modes.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">This post is a field guide for engineers who have lived through the grind of parsing messy PDFs, spreadsheets, scanned faxes, and multi-column filings. We&#8217;ll break down a rigorous evaluation methodology that goes beyond surface-level accuracy scores.<\/p>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">The Problem With Single Accuracy Scores<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Most document AI discussions rely on a single number\u2014usually accuracy or F1 on a small corpus of clean PDFs. That&#8217;s comfortable for comparison, but it ignores four independent axes of failure that each make extraction outputs unusable in production:<\/p>\n<ol class=\"[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3\">\n<li class=\"whitespace-normal break-words pl-2\"><strong>Character recognition errors<\/strong> at the glyph level<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Layout interpretation errors<\/strong> including reading order and section boundaries<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Structure loss<\/strong> in tables, lists, outlines, and nested hierarchies<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Semantic corruption<\/strong> from model normalization, substitution, or hallucination<\/li>\n<\/ol>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">If any one of these fails, the entire document pipeline fails.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">A page can have perfect OCR but broken reading order that renders text incoherent to any LLM in a retrieval system. A table can have correct wire detection but incorrect structure that destroys header-to-data relationships. A contract can have accurate paragraphs but orphaned cross-references that make it impossible to answer basic questions.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">This is why Hirize evaluates document extraction along multiple orthogonal axes\u2014not just a single accuracy score.<\/p>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">How Hirize Approaches Document Extraction Evaluation<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Our document intelligence evaluation framework scores extraction models along orthogonal axes, with dataset slices and decision rules per axis. The goal is to measure not only average performance, but variance under real operating conditions.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">We evaluate extraction at multiple nodes in the pipeline. For example, Hirize computes text similarity per region immediately after OCR and again after structure mapping to catch ordering or structure regressions.<\/p>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">Reading Order: The Silent Failure Mode<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Reading order breaks whenever models process pages in raster order. Enterprise documents treat pages as physical constraints, not logical boundaries. The result is semantic fragmentation unless the pipeline reconstructs continuity explicitly.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Hirize uses a four-stage reading order subsystem:<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\"><strong>1. Pre-segmentation<\/strong> identifies blocks, figures, tables, headers, and footers. We learn a coarse taxonomy of block types to feed different downstream logic.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\"><strong>2. Column graph construction<\/strong> converts spatial relations into a directed acyclic graph encoding containment and adjacency. This prevents naive left-to-right ordering from crossing columns or mixing headers\/footers with main content.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\"><strong>3. Cross-page linking<\/strong> carries context\u2014section titles, table headers, figure references\u2014across page boundaries. We maintain state for open sections and tables, similar to a streaming parser.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\"><strong>4. Semantic stitching<\/strong> merges block sequences into paragraphs and sections, preserving hierarchy and reference links.<\/p>\n<h3 class=\"text-text-100 mt-2 -mb-1 text-base font-bold\">Reading Order Accuracy Metric<\/h3>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">We evaluate reading order as minimum edit distance over block sequences. Let \u03c0* be the ground truth ordering and \u03c0\u0302 the predicted ordering<\/p>\n<div class=\"relative group\/copy bg-bg-000\/50 border-0.5 border-border-400 rounded-lg\">\n<div>\n<pre class=\"code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed\"><code>ROAcc = 1 - EditDistance(\u03c0\u0302, \u03c0*) \/ max(|\u03c0\u0302|, |\u03c0*|)<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">We also report Kendall tau rank correlation to capture pairwise inversions.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Hirize slices ROAcc by layout pattern: single column, two column, multi-column with sidebars, and newspaper-style with interstitial artifacts. We report separate scores for cross-page continuity by computing the metric only on edges that cross page boundaries.<\/p>\n<h3 class=\"text-text-100 mt-2 -mb-1 text-base font-bold\">Continuity Checks<\/h3>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Hirize runs explicit tests for continuity failure modes:<\/p>\n<ul class=\"[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3\">\n<li class=\"whitespace-normal break-words pl-2\"><strong>Header carry-forward:<\/strong> Verify that table headers detected on page k associate with rows on pages k+1..k+m until a terminating header appears<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Definition resolution:<\/strong> Verify that references like &#8220;Section 2.1&#8221; link to the correct section token across page boundaries<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Footnote binding:<\/strong> Check that superscript markers link to footnote bodies<\/li>\n<\/ul>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">Table Extraction: The Hardest Problem in Document AI<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Tables fail in ways that don&#8217;t show up in simple text metrics like CER or WER. Without structure, correct wire boundaries are useless. Hirize treats table evaluation as both a tree structure and a set of alignment constraints.<\/p>\n<h3 class=\"text-text-100 mt-2 -mb-1 text-base font-bold\">Structure Metric: TEDS<\/h3>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">We compute Tree Edit Distance-based Similarity (TEDS) between predicted and ground truth table trees. Trees are derived from HTML or Markdown with explicit rowspan and colspan markers.<\/p>\n<h3 class=\"text-text-100 mt-2 -mb-1 text-base font-bold\">Alignment Metrics<\/h3>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Hirize adds alignment tests reflecting how downstream systems actually use tables:<\/p>\n<ul class=\"[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3\">\n<li class=\"whitespace-normal break-words pl-2\">Header-to-column association accuracy across multi-row headers<\/li>\n<li class=\"whitespace-normal break-words pl-2\">Column alignment accuracy for numeric vs. textual columns<\/li>\n<li class=\"whitespace-normal break-words pl-2\">Merged cell preservation rate for rowspan\/colspan recovery<\/li>\n<li class=\"whitespace-normal break-words pl-2\">Row grouping consistency for hierarchical row headers<\/li>\n<\/ul>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Table structure brittleness is the top source of silent data corruption in document extraction. Numeric columns that shift by one position pass text similarity checks yet break analytics pipelines. Alignment and header association metrics catch this failure class.<\/p>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">Text and Layout Metrics<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Hirize computes ANLS (Average Normalized Levenshtein Similarity) at the region level rather than page level. Regions are blocks produced by layout segmentation. This prevents high ANLS from masking incorrect ordering or boundary errors.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">We also compute Section Boundary Detection F1 from labeled boundaries\u2014headings, list starts, table starts, figure captions, and appendix markers.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Per-region ANLS prevents layout errors from hiding under high page averages. The region view correlates with human review speed and reduces approval time in QA loops.<\/p>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">Statistical Rigor in Document Extraction Evaluation<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Hirize reports uncertainty and significance explicitly:<\/p>\n<ul class=\"[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3\">\n<li class=\"whitespace-normal break-words pl-2\">Each metric includes a 95% confidence interval computed with non-parametric bootstrap over documents<\/li>\n<li class=\"whitespace-normal break-words pl-2\">For comparing systems A and B, we report the bootstrap distribution of paired metric delta and percentile interval<\/li>\n<li class=\"whitespace-normal break-words pl-2\">We track variance per slice to surface brittleness<\/li>\n<\/ul>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">A model that wins on average but collapses on multi-column scans is not production-ready. Hirize&#8217;s evaluation framework surfaces these failure modes before they reach production.<\/p>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">Performance and Robustness Testing<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Latency and throughput matter because enterprise document processing runs at scale. Hirize measures:<\/p>\n<ul class=\"[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3\">\n<li class=\"whitespace-normal break-words pl-2\">Average and tail latency per page at controlled concurrency levels<\/li>\n<li class=\"whitespace-normal break-words pl-2\">Error rates for corrupted or unsupported formats<\/li>\n<li class=\"whitespace-normal break-words pl-2\">Robustness under controlled noise: resolution drop, JPEG artifacts, mild skew, missing grid lines<\/li>\n<\/ul>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Robustness curves are more actionable than single-point scores. Noise sensitivity tests often reveal steep drops at realistic perturbation sizes\u2014engineering targets for preprocessing or model fine-tuning.<\/p>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">Key Insights From Evaluating Document Extraction at Scale<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">After processing 500M+ pages, these patterns repeat across corpora and inform engineering priorities:<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\"><strong>Reading order dominates perceived quality.<\/strong> Users tolerate light OCR noise but reject outputs that read out of order.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\"><strong>Table structure brittleness causes silent data corruption.<\/strong> Numeric columns shifting by one position pass text checks but break analytics. Alignment and header association metrics catch this.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\"><strong>Per-region metrics beat per-page metrics.<\/strong> They correlate better with human review speed and QA effort.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\"><strong>Continuity checks reduce error reports in long documents.<\/strong> Persisting table headers and section state across pages reduces hallucinated headers and orphaned rows.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\"><strong>Robustness curves reveal engineering targets.<\/strong> Sensitivity tests surface where preprocessing or fine-tuning will have the highest impact.<\/p>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">Practical Guidance for Document Extraction Teams<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Based on Hirize&#8217;s evaluation methodology, here&#8217;s what we recommend:<\/p>\n<ol class=\"[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3\">\n<li class=\"whitespace-normal break-words pl-2\"><strong>Score along orthogonal axes:<\/strong> Text similarity, reading order, structure, continuity, and robustness<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Slice by layout archetype:<\/strong> Single column, multi-column, tables with\/without lines, scans vs. digital<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Track uncertainty:<\/strong> Report confidence intervals and paired bootstrap deltas<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Prefer per-region metrics:<\/strong> They correlate better with review effort than per-page scores<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Promote continuity to first-class status:<\/strong> Treat cross-page links as ground truth objects<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Evaluate under perturbation:<\/strong> Measure sensitivity to realistic noise and format changes<\/li>\n<\/ol>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">The Future of Document Extraction Evaluation<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Open problems requiring new evaluation methods:<\/p>\n<ul class=\"[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3\">\n<li class=\"whitespace-normal break-words pl-2\"><strong>Cross-modal consistency<\/strong> across text, charts, stamps, and annotations<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Evolving templates<\/strong> that change section order, table shapes, and boilerplate<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Ground truth creation at scale<\/strong> without label drift<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Input integrity<\/strong> for extractors that include language models<\/li>\n<li class=\"whitespace-normal break-words pl-2\"><strong>Cost-efficient evaluation<\/strong> that runs continuously on production samples<\/li>\n<\/ul>\n<h2 class=\"text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold\">Building Production-Ready Document Intelligence<\/h2>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">Evaluating document extraction isn&#8217;t about chasing a single leaderboard score. It&#8217;s about proving your system will keep working on the next million documents you haven&#8217;t seen yet.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">The only way to do that is to measure what actually breaks pipelines in production. Reading order and table structure are the twin pillars of that effort. Treat them as first-class citizens in your evaluation, and your document extraction models will improve where it matters.<\/p>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">At Hirize, this evaluation methodology is built into our document intelligence infrastructure. Every page we process is held to these standards\u2014which is why our customers trust us with their most critical document workflows.<\/p>\n<hr class=\"border-border-200 border-t-0.5 my-3 mx-1.5\" \/>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\"><em>Want to see how Hirize handles complex document extraction? <a class=\"underline underline underline-offset-2 decoration-1 decoration-current\/40 hover:decoration-current focus:decoration-current\" href=\"https:\/\/www.hirize.ai\">Request a demo<\/a> or explore our API documentation.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A 98% accurate document extraction system can still produce 4,000 errors per 1,000 pages. Here&#8217;s why single metrics mislead and how Hirize evaluates what actually breaks pipelines.<\/p>\n","protected":false},"author":1,"featured_media":1410,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[15,16,17],"class_list":["post-1409","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-insights","tag-document-intelligence","tag-document-intelligence-api","tag-document-processing-ai"],"_links":{"self":[{"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/posts\/1409","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/comments?post=1409"}],"version-history":[{"count":1,"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/posts\/1409\/revisions"}],"predecessor-version":[{"id":1411,"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/posts\/1409\/revisions\/1411"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/media\/1410"}],"wp:attachment":[{"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/media?parent=1409"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/categories?post=1409"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.hirize.ai\/index.php\/wp-json\/wp\/v2\/tags?post=1409"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}