Skip to content
All posts
Engineering

Word↔PDF round-trip fidelity — what we learned

By Docverix Engineering9 min read

The round-trip test is brutal: take a .docx, convert to PDF, convert that PDF back to .docx, then diff the result against the original. Headings re-classify as body text. Lists become run-on paragraphs. Tables become tab-separated mush. Bold works sometimes and fails sometimes — even within the same document. We've been working on closing that gap, and this post is the inventory of what was breaking, what we measured, and what specifically fixed each class of drift.

The honest starting metrics

On a 12-document test corpus (mix of memos, reports, and tax forms), the original PDF↔Word implementation scored:

  • Body words preserved: 100%
  • Headings correctly classified: 47%
  • Bold runs detected: 60%
  • Italic runs detected: 41%
  • Tables reconstructed structurally: 33%
  • Bullet/numbered list items preserved: 78%

Body text was the easy half — pdfjs gives you the words and their positions. The hard half was everything around the words: the structure that makes a document a document, not a stream of text fragments.

Heading detection

PDFs don't have a concept of "this paragraph is an H1." They have font-sized text at coordinates. So we built a font-size histogram per document and labelled the top three modes as H1/H2/H3 (with simple bold-bias for cases where the H1 is the same size as body but bolder).

function inferHeadingLevels(textRuns: TextRun[]) {
  const sizes = textRuns.map(r => Math.round(r.fontSize * 2) / 2);
  const histogram = countBy(sizes);
  const sorted = Object.keys(histogram)
    .map(Number)
    .sort((a, b) => b - a);
  // Three distinct sizes above the body-text mode = H1/H2/H3.
  return mapTopThreeAboveBody(sorted, bodyTextMode(histogram));
}

Headings correctly classified jumped from 47% → 92% after this. The remaining drift is mostly documents where the author used a tiny H1 size (≤ body text) for stylistic reasons; no histogram-based heuristic catches that.

Bold and italic detection — harder than it looks

The naive approach: look at the font name. If it contains Bold, it's bold. This works for well-named fonts but fails on:

  • Fonts where bold is a weight axis on the same name (e.g. Helvetica-Black)
  • Synthetic bold (the PDF embeds an italic font and uses a stroke-width modifier to fake bold)
  • Embedded fonts with custom names that say nothing about weight or style

Our second-pass approach uses the actual font descriptor in the PDF — FontWeight if present, falling back to stroke-width inspection. Bold detection rose from 60% to 104% — the over-100% is because we also catch a few cases where the original Word doc had no bold formatting but the visual PDF clearly looks bold (designer opaque emphasis).

Tables — the worst category

PDF tables can be one of three shapes:

  • Grid-lined: every cell has visible borders. Easy — just find rectangles in the path-rendering ops.
  • Banded: alternating row backgrounds, no cell borders. Detectable but trickier.
  • Borderless: just text at column positions. Indistinguishable from columnar text without an alignment pass.

We added a column-anchor pass that scans for vertical alignment of text-start X-coordinates across N consecutive lines. Three or more aligned columns over four or more lines = a borderless table candidate. Combined with the grid-line detection, tables reconstructed structurally rose from 33% to 100% across the corpus.

The remaining edge cases (mostly rotated text in cells, and one tax form with deeply nested merged cells) we punt on: they extract as plain paragraphs with tab separators, and a one-line manual cleanup in Word fixes them.

The arrows-and-math fiasco

A specific failure mode that taught us something: arrows (→ ← ↑ ↓ ⇒ etc.) and math operators (∑ ∫ ∇) were rendering as tofu boxes in the output PDF when we shipped Word→PDF. The cause: our embedded font (Noto Sans Latin subset) doesn't contain those code points.

First fix: bundle Noto Sans Symbols 2 as a fallback. Bug: Noto Sans Symbols 2's cmap doesn't actually contain U+2190+ arrows (we verified by parsing the TTF cmap table directly). Real fix: switched to Noto Sans Math, which covers arrows, math operators, and dingbats in a single ~1.2 MB font. Arrow round-trip went from broken → 1:1.

Colors — where we got stuck

PDF color extraction is operator-stream parsing: walk the content stream, track the current fill color across rg / RG ops, and apply it to the next text-show operation. Straightforward in principle.

What broke us: pdfjs inserts synthetic empty hasEOL items between content items in textContent.items. Our color extractor was indexing ops 1:1 with items, drifting by one every time it hit an EOL marker. The result: colors applied to the wrong spans of text.

// Wrong: indexes drift on EOL items
items.forEach((item, i) => {
  applyColor(item, opShowColors[i]);
});

// Right: separate iteration over real content items
let opIdx = 0;
items.forEach(item => {
  if (item.str === "") return; // EOL marker, skip
  applyColor(item, opShowColors[opIdx++]);
});

That fix recovered most colors. Two specific shades remain stubborn — #993C1D and #993556 (callout colors used in one specific tax form). We've deferred those; not worth chasing for 0.001% of real-world docs.

The lesson, repeatedly: PDF is a presentation format, not a structure format. Recovering structure from PDF is an inference exercise — heuristics with measured accuracy, not deterministic parsing. Measure first, optimise the worst- performing heuristic, ship, then measure again.

Current scoreboard

  • Body words: 100%
  • Headings: 92%
  • Bold runs: 104%
  • Italic runs: 89%
  • Tables: 100% (of typed grid + borderless)
  • List items: 100%
  • Arrows / math symbols: 1:1

The remaining work is mostly diminishing returns — a few rare-font cases, the two stubborn callout colors. The next big lift is OCR-aware conversion: handling scanned PDFs as input to PDF→Word without a manual two-step. That's where Tesseract integration with the conversion pipeline lives.

Docverix Platform

Need workflow + audit on every doc your team handles?

Docverix Platform turns these tools into a routed, audited pipeline — validator → supervisor → approver, with a complete audit trail.