What is OCR and why does accuracy matter?
OCR (Optical Character Recognition) converts printed or handwritten text into machine-readable characters. The technology is indispensable in modern work: digitizing invoices, making contracts searchable, automatically extracting form data – all OCR applications. But accuracy varies dramatically between available solutions. A recognition error can turn a €1,000 invoice into €10,000, a customer account 12345 into 12346, or a date 01/03/2026 into 01/08/2026.
In this article, we compare the four most important OCR engines using our own test data: Tesseract (open source), Google Cloud Vision, AWS Textract, and Azure Read. We measure not just recognition rate, but also error types, processing time, cost, and data privacy aspects.
Test methodology
We tested 200 documents of different types and quality:
- 50 printed documents (invoices, contracts, forms) – high quality, clear text
- 50 photos of documents – taken with smartphone camera, varying lighting and perspective
- 50 handwritten notes – different handwriting styles, pencil and ballpoint pen
- 50 scanned historical documents – faded ink, stains, old typefaces (Fraktur with official Tesseract model; Sütterlin only with custom models)
Each OCR engine was tested with default settings. For Tesseract we use version 5.3 with the German language package (deu.traineddata). Cloud APIs were called through their official SDKs.
Recognition rate: The raw numbers
| Engine | Printed | Photos | Handwritten | Historical | Average |
|---|---|---|---|---|---|
| Tesseract 5.3 | 97.8% | 91.2% | 34.5% | 68.7% | 73.1% |
| Google Vision | 99.1% | 96.8% | 78.2% | 87.4% | 90.4% |
| AWS Textract | 98.7% | 95.1% | 45.6% | 79.3% | 79.7% |
| Azure Read | 99.3% | 97.4% | 81.7% | 88.9% | 91.8% |
The results show a clear trend: Cloud APIs are significantly superior for photos and handwritten text, while Tesseract almost matches them for well-printed documents. The biggest difference is in handwriting – Cloud APIs are 44-47 percentage points ahead of Tesseract.
Error types: Not every error is equal
Recognition rate alone doesn't tell the whole story. We categorized errors into four types:
- Critical errors (numbers, IBAN, amounts): "1,000 €" becomes "10,000 €" – can lead to incorrect payments
- Semantic errors (proper names, technical terms): "Müller-Lüdenscheidt" becomes "Müller-Lüdenscheide" – affects searchability
- Formatting errors (columns, tables): Structure is lost – tables recognized as flowing text
- Cosmetic errors (punctuation, spacing): Missing commas or double spaces – affect readability but not content
For printed documents, critical errors account for only 5-8% of total errors. For handwritten notes, the proportion of critical errors rises to 15-22%. Tesseract has a known problem distinguishing similar characters (0/O, 1/l/I, 5/S), which is particularly critical for IBANs and invoice numbers.
Cost comparison
| Engine | Cost per 1,000 pages | Setup cost | Minimum purchase |
|---|---|---|---|
| Tesseract | $0 | High (infrastructure) | None |
| Google Vision | $1.50 | Low | None |
| AWS Textract | $1.50 per 1,000 pages | Low | None |
| Azure Read | $1.50 (Read) / $10 (Layout) | Low | None |
Tesseract is free, but infrastructure costs (servers, maintenance, updates) are not negligible. For 10,000 pages per month, cloud costs are $10-15 – clearly less than half a developer day for Tesseract infrastructure.
Data privacy: The elephant in the room
Cloud OCR means: Your documents are sent to servers owned by Google, Amazon, or Microsoft. For personal data (invoices, contracts, applications), this is a GDPR issue. Cloud providers have DPA agreements, but:
- Google Vision: Online requests processed in-memory only (not persisted); EU endpoints available (eu-vision.googleapis.com). Google does not use submitted images for training
- AWS Textract: Offers EU regions (Frankfurt, Ireland), does not store data permanently
- Azure Read: EU regions available, comprehensive compliance certifications (ISO 27001, SOC 2, HIPAA, BSI C5) – comparable to AWS and Google Cloud
For GDPR-critical documents, Tesseract is the safest choice since all data is processed locally. Alternatively, AWS and Azure offer EU regions that enable GDPR-compliant processing.
With PDF to Text or image conversion on wandlio.de, images are processed locally in the browser – no upload, no privacy risk.
Practical tips: Choosing the right engine
- Printed documents, GDPR-critical: Tesseract with German language package. 97-98% accuracy is sufficient for most applications
- Printed documents, cloud-ok: Azure Read – best accuracy for printed text, good EU support
- Photos of documents: Google Vision or Azure Read – both over 96% recognition rate
- Handwriting: Azure Read (81.7%) or Google Vision (78.2%) – Tesseract is not suitable here
- Historical documents: Azure Read with special training data – or Tesseract with Fraktur model
- High volume, cost-sensitive: AWS Textract – $1.50 per 1,000 pages (volume discount above 1M pages/month: $0.60)
Tesseract in detail: Strengths and weaknesses
Tesseract is the only open-source OCR engine suitable for production use. Version 4.0 (2018) introduced LSTM-based neural networks – the biggest accuracy leap. Version 5.x is a C++ modernization with performance and bugfix improvements on the same LSTM engine. Strengths:
- Free and privacy-friendly (local processing)
- Supports over 100 languages
- Integrable in Python (pytesseract), Node.js (tesseract.js), and Docker
- PDF input with pdftoppm preprocessor
Weaknesses:
- Poor recognition of handwritten text (34.5%)
- No automatic layout recognition (tables, columns)
- Sensitive to perspective distortion and poor lighting
- No automatic post-processing (spell correction)
For simple documents with clear text, Tesseract is sufficient. For complex layouts, handwriting, or historical documents, cloud APIs are the better choice.
Preprocessing: Why good preparation matters more than the engine
An often underestimated factor for OCR accuracy is image preprocessing. Before the text even reaches the OCR algorithm, simple steps can improve recognition by 5-15 percentage points:
- Binarization: Converting to black-and-white with adaptive thresholding removes color noise and improves contrast
- Orientation correction: Automatic detection and correction of document orientation (0°, 90°, 180°, 270°)
- Noise reduction: Median filters and morphological operations remove stray pixels and scan artifacts
- Scaling: Upscaling to 300 DPI significantly improves recognition for low-resolution originals
In our tests, simple preprocessing (binarization + orientation correction) improved Tesseract recognition on photos from 91.2% to 94.8% – a 3.6 percentage point improvement without changing the engine. For cloud APIs, the effect is smaller (1-2 percentage points) since they already preprocess internally.
Conclusion
The choice of OCR engine depends on the use case: Tesseract is free and privacy-friendly, but significantly weaker than cloud APIs for photos, handwriting, and historical documents. Azure Read offers the best overall accuracy (91.8% average), followed by Google Vision (90.4%). AWS Textract is cost-effective at high volumes but weaker on handwriting. For GDPR-critical documents, Tesseract remains the safest choice – with acceptable accuracy for printed text. Cloud APIs offer the best recognition rate but require GDPR compliance review.
Processing time and resource usage
Processing speed is an often underestimated factor, especially for large document volumes. Our measurements on a standard server (4 CPU cores, 8 GB RAM):
| Engine | 1 page (sec) | 100 pages (sec) | GPU acceleration | RAM usage |
|---|---|---|---|---|
| Tesseract | 2.1 | 210 | No | 200-500 MB |
| Google Vision | 1.5 | 45 | Yes (Google-side) | 0 (API) |
| AWS Textract | 2.0 (sync) | 120 (async) | Yes (AWS-side) | 0 (API) |
| Azure Read | 1.8 | 90 | Yes (Azure-side) | 0 (API) |
Tesseract is significantly slower than cloud APIs, especially for asynchronous processing. For batch processing of thousands of documents, cloud APIs are over 10x faster at 0.45-0.90 seconds per page. GPU acceleration from cloud providers is the decisive factor.
For real-time applications (mobile apps, web scanners), cloud APIs are critical for latency: Google Vision typically responds in 1-2 seconds, while Tesseract on-device takes 2-5 seconds depending on image size and smartphone CPU.
Combined solutions: Best of both worlds
In production use, a combination strategy has proven effective: Tesseract as fallback for GDPR-critical documents and offline scenarios, Cloud APIs for complex cases (handwriting, photos, historical documents). The decision can be automated:
- Step 1: Tesseract processes the document locally
- Step 2: If confidence is below 95%, the document is sent to the cloud API
- Step 3: For GDPR-critical documents (IBAN, tax returns), only Tesseract is used
This architecture combines the data privacy advantages of Tesseract with the accuracy of cloud APIs. Typically 80-90% of documents are correctly recognized by Tesseract, only the remaining 10-20% require cloud support.
