The Docverix OCR tool runs entirely in your browser. Drop a scan or photo, pick a language, click extract. Tesseract.js runs as a Web Worker on your machine; the recognised text comes back with a confidence score per word so you can see at a glance which parts need a manual re-read. No upload, no server.
This post is the architecture under that: how the worker is spawned, how language packs lazy-load, how multi-page PDFs get rendered to canvases, how the confidence display works, and the decision logic for when to escalate to the (opt-in) server-side vision model.
The 24-language list — and why grouped
The picker covers five script families:
- Latin / European — English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Turkish, Vietnamese
- Cyrillic — Russian, Ukrainian
- CJK — Chinese Simplified, Chinese Traditional, Japanese, Korean
- Indic — Hindi, Bengali, Tamil, Telugu
- Right-to-left — Arabic, Hebrew, Persian (Farsi), Urdu
Two reasons for grouping. The first is UX: 24 languages in a flat dropdown is a scrolling list. Grouped, you scan to your script family and then your specific language in two saccades. The second is honesty: Tesseract performs differently across scripts. Latin and Cyrillic are excellent at clean 300 DPI; CJK and Indic drop noticeably; RTL languages need preprocessing for proper baseline detection. The groups tell the user implicitly what category their language falls into.
Lazy language packs
A Tesseract language pack is 5–15 MB. Loading all 24 upfront would mean a ~250 MB JS payload, which is absurd. We lazy-load each pack on first use:
const worker = await tesseract.createWorker(language, undefined, {
logger: m => /* progress events */,
});
// First call: tesseract.js fetches <lang>.traineddata from CDN.
// Subsequent calls: served from the browser's HTTP cache.Tesseract.js handles the actual fetch — it pulls from tessdata.projectnaptha.com, which serves the standard tessdata-fast variants. A first-use OCR on English shows a ~3-second download progress bar; subsequent runs are instant.
We surface this in the UI: a small line under the language picker says "First use downloads a ~5–15 MB language pack (cached after)." Stops people from thinking the tool hung when their first OCR run takes a moment.
Multi-page PDF input
The OCR tool accepts PDFs, not just images. A PDF gets rendered page-by-page to off-screen canvases, then each canvas is fed to Tesseract:
async function renderPdfPageToCanvas(file, pageIndex) {
const pdfjs = await getPdfjs();
const doc = await pdfjs.getDocument({ data: await file.arrayBuffer() }).promise;
const page = await doc.getPage(pageIndex + 1);
const viewport = page.getViewport({ scale: 2 }); // 2x = ~200 DPI
const canvas = document.createElement("canvas");
canvas.width = Math.ceil(viewport.width);
canvas.height = Math.ceil(viewport.height);
const ctx = canvas.getContext("2d");
ctx.fillStyle = "#FFFFFF"; // white background — Tesseract
ctx.fillRect(0, 0, canvas.width, canvas.height); // expects light bg
await page.render({ canvas, canvasContext: ctx, viewport }).promise;
return canvas;
}Two non-obvious details. The scale: 2bumps the render resolution to ~200 effective DPI even for documents authored at 96 DPI; this is empirically where Tesseract accuracy stops scaling with resolution. And the white-fill before render is necessary because PDFs without explicit page backgrounds render transparent, and Tesseract mis-thresholds transparent regions as "all foreground".
Per-word confidence scoring
Tesseract returns more than text — every recognised word comes with a confidence score (0–100), a bounding box, and alternative readings. We use the confidence score directly for UI highlighting:
function confidenceClass(c: number): string {
if (c >= 90) return "text-success-700"; // green: trust
if (c >= 70) return "text-warning-700"; // amber: check
return "text-danger-700 underline decoration-warning-500/40 decoration-dotted underline-offset-2"; // red + dotted underline
}The dotted underline on low-confidence words is borrowed from spell-check UI patterns. A user scanning recognised text immediately sees where Tesseract was uncertain and where to double-check against the source image.
When to escalate to the AI vision model
Tesseract has hard failure modes:
- Cursive handwriting — Tesseract is trained on print
- Mixed-script documents (English paragraphs + Arabic in one image)
- Very low DPI photos (under ~150 DPI effective resolution)
- Heavily stylised fonts (logos, distressed lettering)
For these, we surface a "Use AI" toggle in the OCR UI. Flipping it routes the file to a server-side endpoint that calls a vision model. The endpoint is rate-limited per IP (small daily quota), and the file is deleted within 30 minutes — Tesseract is the default; AI is the explicit opt-in.
The decision tree we suggest in the UI is roughly:
- Clean printed text, supported language → default Tesseract
- Handwriting, faded scan, mixed-script → flip the toggle
- Unsupported language (Thai, Greek, more Indic scripts) → flip the toggle
Tab keeps running — but only if it's open
A non-obvious failure mode worth flagging: Tesseract's recognition happens in a Web Worker, but the worker dies if the parent tab is closed or even backgrounded for too long on mobile (modern mobile browsers aggressively suspend background tabs). For a 50-page PDF that takes 3–5 minutes to OCR, this matters: closing the tab loses the work.
We surface this in the help article — "Plug into power on a laptop if it's a 50+ page doc" — and on mobile we throw a beforeunload warning during long jobs. A more durable fix would be Service Worker–backed background processing; on the bench but not shipped.
The best part of putting OCR in the browser is the same as everything else: the user can verify it's really local. Open DevTools, watch the Network panel, run a recognition. You'll see the language-pack request once (and then never again, because it caches). You will not see your image.
Accuracy numbers, since you'll ask
On a 100-document calibration set:
- Clean printed text @ 300 DPI: 95–98% character accuracy
- Receipt photos (phone-shot, indoor lighting): 88–93%
- Faded fax / 150 DPI scans: 65–80% (AI toggle helps a lot)
- Block-letter handwriting: 50–70% (AI toggle helps; cursive much worse)
- Multi-page PDFs concatenate: text from each page is separated by
--- Page N ---markers in the output
The confidence-highlighting UI is doing real work here: even at 70% accuracy, a human can quickly scan the red-highlighted words and correct them. That's much faster than proofreading 100% of recognized text against the source image.