For the past year, Sarvam has talked about sovereignty. This week, the Bengaluru-based startup began shipping at a pace that makes it appear more like a frontier AI lab, such as OpenAI or Anthropic. The latest launch is Sarvam Vision, a three-billion-parameter vision-language model designed to read, understand, and extract information from documents in English and 22 Indian languages. The company is likely to announce a voice model named Bulbul soon. On the surface, it resembles an OCR engine. In practice, it aims to do more. The model captions images, parses tables, understands charts, and converts messy scans into structured data. Less scanner software, more document intelligence. Sarvam's strongest claim is accuracy on Indic scripts, an area where global systems have historically struggled. On its in-house Sarvam Indic OCR Bench, which includes 20,267 samples across scheduled languages, the model reports word accuracy of 95.91% for Hindi, 92.61% for Bengali, 93.42% for Tamil, 93.13% for Marathi, and 91.60% for Malayalam. Lower-resource scripts such as Santhali and Dogri cross 80% accuracy in several cases. Global systems have historically struggled here because most are trained in English first and retrofit the rest. Most global OCR and vision-language systems are trained primarily in English and later extended to other languages. This approach often breaks down in Indian documents, where font layouts and scripts vary widely. On the English olmOCR benchmark, Sarvam says it is competitive across math-heavy pages, tiny fonts, headers, footers, multi-column layouts, and tables. Table recognition touches 88.3%. That detail matters. Indian workflows are dominated by forms, ledgers, invoices, and PDFs. When tables fail, automation fails. The release also expands Sarvam beyond voice and text into multimodal AI. Last week, the company rolled out Sarvam Audio ASR for a multilingual, multi-speaker speech recognition system. Most ASR models work well on clean, single-speaker audio, but struggle with code-mixed language, interruptions, and overlapping speech, common in Indian customer support calls. Sarvam says that in internal tests on the IndicVoices benchmark, its systems reported lower word error rates than GPT-4o-Transcribe and Gemini 3 Flash across unnormalised, normalised, and code-mixed speech. |
Комментариев нет:
Отправить комментарий
Примечание. Отправлять комментарии могут только участники этого блога.