Skip to content
All posts
Engineering

Inside the 24-language OCR pipeline

By Docverix Engineering8 min read

The Docverix OCR tool runs entirely in your browser. Drop a scan or photo, pick a language, click extract. Tesseract.js runs as a Web Worker on your machine; the recognised text comes back with a confidence score per word so you can see at a glance which parts need a manual re-read. No upload, no server.

This post is the architecture under that: how the worker is spawned, how language packs lazy-load, how multi-page PDFs get rendered to canvases, how the confidence display works, and the decision logic for when to escalate to the (opt-in) server-side vision model.

The 24-language list — and why grouped

The picker covers five script families:

  • Latin / European — English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Turkish, Vietnamese
  • Cyrillic — Russian, Ukrainian
  • CJK — Chinese Simplified, Chinese Traditional, Japanese, Korean
  • Indic — Hindi, Bengali, Tamil, Telugu
  • Right-to-left — Arabic, Hebrew, Persian (Farsi), Urdu

Two reasons for grouping. The first is UX: 24 languages in a flat dropdown is a scrolling list. Grouped, you scan to your script family and then your specific language in two saccades. The second is honesty: Tesseract performs differently across scripts. Latin and Cyrillic are excellent at clean 300 DPI; CJK and Indic drop noticeably; RTL languages need preprocessing for proper baseline detection. The groups tell the user implicitly what category their language falls into.

Lazy language packs

A Tesseract language pack is 5–15 MB. Loading all 24 upfront would mean a ~250 MB JS payload, which is absurd. We lazy-load each pack on first use:

const worker = await tesseract.createWorker(language, undefined, {
  logger: m => /* progress events */,
});
// First call: tesseract.js fetches <lang>.traineddata from CDN.
// Subsequent calls: served from the browser's HTTP cache.

Tesseract.js handles the actual fetch — it pulls from tessdata.projectnaptha.com, which serves the standard tessdata-fast variants. A first-use OCR on English shows a ~3-second download progress bar; subsequent runs are instant.

We surface this in the UI: a small line under the language picker says "First use downloads a ~5–15 MB language pack (cached after)." Stops people from thinking the tool hung when their first OCR run takes a moment.

Multi-page PDF input

The OCR tool accepts PDFs, not just images. A PDF gets rendered page-by-page to off-screen canvases, then each canvas is fed to Tesseract:

async function renderPdfPageToCanvas(file, pageIndex) {
  const pdfjs = await getPdfjs();
  const doc = await pdfjs.getDocument({ data: await file.arrayBuffer() }).promise;
  const page = await doc.getPage(pageIndex + 1);
  const viewport = page.getViewport({ scale: 2 }); // 2x = ~200 DPI
  const canvas = document.createElement("canvas");
  canvas.width = Math.ceil(viewport.width);
  canvas.height = Math.ceil(viewport.height);
  const ctx = canvas.getContext("2d");
  ctx.fillStyle = "#FFFFFF";       // white background — Tesseract
  ctx.fillRect(0, 0, canvas.width, canvas.height); // expects light bg
  await page.render({ canvas, canvasContext: ctx, viewport }).promise;
  return canvas;
}

Two non-obvious details. The scale: 2bumps the render resolution to ~200 effective DPI even for documents authored at 96 DPI; this is empirically where Tesseract accuracy stops scaling with resolution. And the white-fill before render is necessary because PDFs without explicit page backgrounds render transparent, and Tesseract mis-thresholds transparent regions as "all foreground".

Per-word confidence scoring

Tesseract returns more than text — every recognised word comes with a confidence score (0–100), a bounding box, and alternative readings. We use the confidence score directly for UI highlighting:

function confidenceClass(c: number): string {
  if (c >= 90) return "text-success-700";       // green: trust
  if (c >= 70) return "text-warning-700";       // amber: check
  return "text-danger-700 underline decoration-warning-500/40 decoration-dotted underline-offset-2"; // red + dotted underline
}

The dotted underline on low-confidence words is borrowed from spell-check UI patterns. A user scanning recognised text immediately sees where Tesseract was uncertain and where to double-check against the source image.

When to escalate to the AI vision model

Tesseract has hard failure modes:

  • Cursive handwriting — Tesseract is trained on print
  • Mixed-script documents (English paragraphs + Arabic in one image)
  • Very low DPI photos (under ~150 DPI effective resolution)
  • Heavily stylised fonts (logos, distressed lettering)

For these, we surface a "Use AI" toggle in the OCR UI. Flipping it routes the file to a server-side endpoint that calls a vision model. The endpoint is rate-limited per IP (small daily quota), and the file is deleted within 30 minutes — Tesseract is the default; AI is the explicit opt-in.

The decision tree we suggest in the UI is roughly:

  • Clean printed text, supported language → default Tesseract
  • Handwriting, faded scan, mixed-script → flip the toggle
  • Unsupported language (Thai, Greek, more Indic scripts) → flip the toggle

Tab keeps running — but only if it's open

A non-obvious failure mode worth flagging: Tesseract's recognition happens in a Web Worker, but the worker dies if the parent tab is closed or even backgrounded for too long on mobile (modern mobile browsers aggressively suspend background tabs). For a 50-page PDF that takes 3–5 minutes to OCR, this matters: closing the tab loses the work.

We surface this in the help article — "Plug into power on a laptop if it's a 50+ page doc" — and on mobile we throw a beforeunload warning during long jobs. A more durable fix would be Service Worker–backed background processing; on the bench but not shipped.

The best part of putting OCR in the browser is the same as everything else: the user can verify it's really local. Open DevTools, watch the Network panel, run a recognition. You'll see the language-pack request once (and then never again, because it caches). You will not see your image.

Accuracy numbers, since you'll ask

On a 100-document calibration set:

  • Clean printed text @ 300 DPI: 95–98% character accuracy
  • Receipt photos (phone-shot, indoor lighting): 88–93%
  • Faded fax / 150 DPI scans: 65–80% (AI toggle helps a lot)
  • Block-letter handwriting: 50–70% (AI toggle helps; cursive much worse)
  • Multi-page PDFs concatenate: text from each page is separated by --- Page N --- markers in the output

The confidence-highlighting UI is doing real work here: even at 70% accuracy, a human can quickly scan the red-highlighted words and correct them. That's much faster than proofreading 100% of recognized text against the source image.

Docverix Platform

Need workflow + audit on every doc your team handles?

Docverix Platform turns these tools into a routed, audited pipeline — validator → supervisor → approver, with a complete audit trail.