Introducing Sarvam Akshar
A document intelligence workbench built on Sarvam Vision, enabling grounded reasoning, layout-aware extraction, and automated proofreading.

We recently released Sarvam Vision – a state-of-the-art 3B vision-language model for document intelligence in English and 22 Indian languages. The model achieves best-in-class scores on global benchmarks like olmOCR-Bench and OmniDocBench for English; and obtains leading accuracy on the Sarvam Indic OCR Bench for Indian languages better than frontier models such as Gemini 3 Pro, Opus 4.5, and GPT-5.2.
Today we release Akshar – a workbench tailored to solve last-mile problems in knowledge extraction. Akshar functions as the intelligence layer atop the Sarvam Vision model moving beyond passive text extraction to active reasoning. The platform equips agents with critical information – visual grounding, semantic layout details, block-level extractions – to enable automated proofreading and error corrections.
The Limits of Existing Solutions for Document Digitization
Legacy OCR Stack
Classical OCR engines – ranging from open-source staples like Tesseract and EasyOCR to enterprise APIs like Google Cloud Vision – face significant architectural constraints when processing complex, unstructured documents. These systems typically rely on a bottom-up approach, identifying characters and words in isolation without understanding the semantic or spatial context of the page. This limitation manifests as a failure in layout analysis, where multi-column contents are read linearly across the page, resulting in discontinuous text.
Furthermore, these models struggle profoundly with Indic scripts, frequently misinterpreting complex conjuncts and diacritics (matras), or failing to distinguish between body text and marginalia. Such lack of semantic understanding and context of a page renders the output unusable for downstream use cases which require full accuracy.
Multimodal Large Language Models
While modern VLMs have made major strides in document understanding, they are no silver bullet. Frontier models continue to yield low-accuracy outputs as complexity of a document increases (e.g., historical newspapers in Indian languages, publications with embedded charts, among others). This class of models have solved problems such as reading order prediction, key-value extraction, and multimodal understanding. However, they introduced additional gaps – such as probabilistic outputs, lack of auditability, and required manual tuning of prompts leading to factual inconsistencies.
What remains unsolved?
Despite all recent progress, several last-mile complexities persist: (a) zero-shot parsing of complex layouts without training; (b) capturing semantic relationships between elements; (c) visual grounding; (d) automated error localization and correction with human-in-the-loop at scale.
Sarvam Vision x Akshar: A Knowledge Extraction Workbench
Built upon Sarvam Vision’s capabilities, Akshar takes a more fundamental approach to document processing. Let’s take the case of digitizing historical documents dating back to the 1800s. The 19th-century Gujarati or Tamil manuscripts, archaic fonts and complex conjuncts (matras) often are hallucinated modern spellings at baseline. A raw API output requires a linguist to compare the text against the image line-by-line – a prohibitively slow process. Akshar solves this by coupling the VLM with an agents loop. The workbench can identify script uncertainties allowing experts to validate hundreds of pages in the time it typically takes to transcribe one.
Visual grounding
Pinpoint exact coordinates of text and elements in documents.
Source document
Input
Akshar
SarvamExtracted blocks · 94 detected
Layout understanding
Build a comprehensive representation of each page by understanding the semantic structure of the document.
Source document
Input
Akshar
SarvamExtracted blocks · 109 detected
Enrich visual components with text
Caption charts, images, and other visual elements
Source document
Input
Akshar
SarvamExtracted blocks · 20 detected
Agent Orchestration: Plan → Reason → Act
The best document intelligence models work well on the majority of documents – achieving up to 90-95% accuracy. In domains such as healthcare, finance, government, among others, 100% is critical to aid important decision-making. Such real-world complexities can break even the most sophisticated systems.
Akshar represents a paradigm shift – from passive extraction to that of semantic understanding and contextual reasoning. By harnessing multimodal LLMs, the platform functions as a cohesive intelligence layer that reads, interprets, and executes complex flows.
How Agents Work
- Baseline Extraction: Documents are processed through Sarvam Vision’s harness modules and VLM for extractions. Agents identify, fix common errors.
- Agent Capabilities:
- Digitization: At upload time, any user-provided instructions for a document are executed autonomously across each page.
- Proofreading: Human-in-the-loop is now easier. Using the chat interface, interact with the agent to fix issues across pages or in select blocks.
- Actions and Memory: Agent performs tasks and updates memory for future recall.
All agent actions are proposed as reviewable suggestions for users to accept or reject, putting human trust and auditability front and center.
Industry Extensions
Akshar’s capabilities extend across various document types and domains:

Extracted blocks · 9 detected
| HAEMATOLOGY - Hb | 13.9 | |||
| PCV | 45.6 | 44.0 | ||
| TOTAL WBC COUNT | 8570 | 6660 | 11900 | 9620 |
| PLATELET COUNT | 2.85 | 2.80 | 2.18 | |
| DIFFERENTIAL COUNT - N | 78.9 | |||
| M | 5.2 | |||
| L | 14.1 | |||
| E | 1.6 | |||
| ESR - 1/2 Hour | 2 | |||
| 1 Hour | 4 | |||
| GLUCOSE PROFILE - RBS | 345 | |||
| FBS | 173 | |||
| PPBS | 260 | |||
| HBA1C | 9.5 | |||
| RFT - UREA | 25 | |||
| CREATININE | 1.0 | 26 | ||
| ELECTROLYTES - SODIUM | 139 | 140 | 143 | 1.0 |
| POTASSIUM | 4.0 | A.4 | 3.9 | |
| CHLORIDE | 100 | |||
| BICARBONATE | 23. | |||
| LIPID PROFILE | ||||
| TOTAL CHOLESTEROL | 225 | |||
| TGL | 78 | |||
| HDL | 57 | |||
| LDL | 152 | |||
| VLDL | 16 | |||
| RF | 3.9 | |||
| THYROID PROFILE - FREE T3 | 2.18 | (2.0-4.4) | ||
| FREE T4 | 1.38 | (0.93-1.7) | ||
| TSH | 1.137 | (0.35-5.50) | ||
| PANCREATIC PROFILE | ||||
| SERUM AMYLASE | ||||
| SERUM LIPASE | ||||
| LFT - BILIRUBIN - TOTAL | ||||
| DIRECT | ||||
| INDIRECT | ||||
| SGOT | ||||
Conclusion
By vertically integrating Sarvam Vision, Akshar aims to minimize friction between model capabilities and real-world applications. It harnesses isolated OCR problems with a comprehensive intelligence layer which orchestrates complex tasks. This approach addresses the reliability deficit inherent in modern AI systems due to their probabilistic nature, allowing us to unlock value trapped in unstructured documents.
Try it!
Akshar is built for real-world document workflows—and we’re just getting started. To kick things off, we’re making Akshar available in Preview at akshar.sarvam.ai, completely free for the month of February.
This is your chance to try it with documents of any kind, explore a wide range of use-cases, and traverse the last mile of accuracy. Every user gets free credits to upload documents, try Agent Mode, and experience Akshar’s capabilities end-to-end.
If you’re an individual or organization interested in using Akshar in production, we encourage you to join our waitlist from within the platform. You can also try Akshar via our Document Intelligence APIs and experience Sarvam Vision on our playground.
Curious what else we're building? Explore our APIs and start creating.
Curious what else we're building?
Explore our APIs and start creating.