IIIT Hyderabad helps make old debate records easy to search and listen to

Hyderabad: Public access to digital archives has been significantly improved with the development of a new search engine, which is now integrated with audiobooks for greater inclusivity. The next step in this initiative is to bridge the language divide with multi-lingual translation.
The search engine was developed in collaboration with the International Institute of Information Technology (IIITH), Hyderabad, under the guidance of Prof. Gurpreet Lehal, Consultant at Punjabi University, Patiala, and Prof. C.V. Jawahar, IIITH, along with support from C-DAC, Noida. This project is part of the National Language Translation Mission, Bhashini.
The digitization of debates dating back to 1947 was completed in 2023 as part of the Punjab Digital Library project, aimed at preserving Punjab’s cultural heritage. However, these PDFs were not searchable images. Prof. Lehal explains that each PDF often contained three languages, English, Hindi, and Punjabi, in different scripts (English, Devanagari, and Gurmukhi). “The first challenge was to develop an OCR system that could recognize the correct script and convert it into text accurately. The next challenge was to make the content searchable, so users could easily find relevant historical debates,” he says. For example, typing “Punjabi Suba” in Hindi would allow the engine to search through a database of over two lakh pages and retrieve all relevant references in the three languages.
The search engine also offers advanced features such as handling variations in spelling and pronunciation. For example, users can search for “Prakash Singh Badal” as “Parkash,” and the engine will correct minor spelling errors. Prof. Lehal highlights that this feature enhances transparency in governance by allowing users to track topics debated by any MLA, including their frequency of participation.
Another significant aspect of this initiative is the inclusion of visually impaired individuals. Legislative archives are now accessible as audiobooks, a process facilitated by Krishna Tulsyan, a researcher at IIITH. Tulsyan explains, “We use consortium OCR to extract Unicode text from the PDFs and then convert the text to speech using Bhashini TTS, making it available in formats like MP3 or DAISY.”
The project’s next phase aims to make the legislative archives accessible in all Indian languages. “If a debate is in Punjabi, it could be made available in Marathi, for example, to a native Marathi speaker,” Prof. Lehal explains. The conversion of text to Unicode has laid the groundwork for further advancements, such as search integration with Bhashini’s machine translation system.
Additionally, converting legislative archives into audiobooks in various languages will further enhance accessibility. The integration of a Large Language Model could allow for intelligent, conversational search capabilities. Prof. Lehal envisions that users will be able to ask natural language questions, such as “What were the key discussions on agricultural reforms in the 1980s?” or “Compare political stances on Punjabi Suba across party lines,” and receive context-aware, summarized responses.
With Punjab taking the lead in transparent governance, it is expected that other state legislatures will follow suit.