• Twitter
  • Facebook
  • Google+
  • LinkedIn

Decoding the Phenomenal Journey of Prof. Sunita Sarawagi

Follow your passion, and everything else falls in a place aptly suits Prof. Sunita Sarawagi, who has earned several laurels to her name in the field of Computer Science and Engineering. She received her B.Tech. in Computer Science from the Indian Institute of Technology, Kharagpur and completed her M.S. and Ph.D. in Computer Science from the University of California at Berkeley. Read on as Prof. Sunita decodes her journey to inspire thousands of students to follow their passion. She is currently serving as Institute Chair Professor, Computer Science and Engineering, IIT Bombay

❖ Congratulations on being awarded Infosys Prize in Engineering and Computer Science for your research! Could you please share the story of the research work that fetched you the Infosys Prize in 2019?

Prof. Sarawagi: When I moved to India in 1999 after a database Ph.D. in the U.S., I wanted to do some India-specific research. A technocrat in the ministry of I.T. proposed this problem of eliminating duplicates in large lists of unstructured addresses motivated by an application in the income tax department. Around the same time, a start-up contacted me for segmenting Indian addresses. I started appreciating the non-triviality of the two tasks and found opportunities for fascinating ML topics such as active learning, CRFs, graphical models, and applications in the Web domain. It seemed to provide the right blend of overlap between databases and machine learning while also being relevant to the Indian context.

❖ You have been pursuing research in the field of the database, data mining, machine learning and natural language processing. Can you please explain your research in layman's language and its implications, especially your work on natural language processing?

Prof. Sarawagi: A huge amount of the world's information is in unstructured text format and scattered across islands of data repositories. However, to harness that knowledge to solve a particular task, e.g., automatically route packets or answer some specific factual query, we need to organise it into a clean, structured format and aggregate them. Trying to achieve this task manually is daunting in scale. My work is in using ML to automate this. For example, ask the Web for a simple quantitative query like "What is the escape velocity on Jupiter". The answer to this query may be available in semi-structured sources like a Web-table or a list. Part of my research is converting these tables and lists into a form where such queries can be extracted to give you the exact quantitative answer.

❖ You are one of the earliest researchers to develop information extraction techniques that went beyond the world of structured databases to the kind of unstructured data one finds on the World Wide Web. Request you to throw some light on the challenges you faced in due course of your career?

Prof. Sarawagi: When I started work in this area, it did not seem like I was one of the early researchers in the area of structure extraction. Early is relative, and with the dizzying pace of machine learning research now, any research in 2000--2010 decade may be considered ancient.
One big challenge for us working from India in early 2000 was internet bandwidth. A second challenge was finding qualified Ph.D. students and project staff. Surprisingly, getting monetary funds was never much of a problem.

❖ Your research is based on the development of fundamental principles and has had practical impacts such as DATAMOLD and QuTree. Could you please elaborate on the same?

Prof. Sarawagi: DATAMOLD is a software we developed in the early 2000s for converting Indian addresses into their structured forms. That tool was based on Hidden Markov Models, but since then, we have developed more accurate modelling based on Semi-Markov Conditional Random Fields (Semi-CRFs).
We designed efficient inference algorithms for Semi-CRFs, so their inference time is almost linear in the length of the sequence while allowing the use of entity-level features such as match with an existing dictionary of entity names. QuTree is software for extracting mentions of numeric quantities and their units from unstructured text. QuTree makes use of probabilistic context-free grammars to combine diverse signals with a human specified grammar.

❖ Can you please share some details on the patents granted you for your research work? Which are the current innovations you are working on?

Prof. Sarawagi: Patents are not much of a measure of success for academic research. When I was in IBM, patents were easy to get because the company facilitated them, and most research staff members (including me) got a handful within a few years of working there. At IITB, like most faculty in academia, I did not bother to patent.
My current research is on domain generalisation and adaptation, neural architectures for structured prediction tasks in NLP, learning with high-level supervision, and forecasting models for temporal sequences.

❖ Since you have been the Institute Chair Professor in the Department of Computer Science and Engineering, could you share any insights on the Institute's research culture and how it has evolved since you have joined?

Prof. Sarawagi: The pace of research of the Institute has certainly been on the upswing since I joined. It is gratifying to see every subsequent generation of faculty working even more fervently on their research than the previous generation.
Unfortunately, the involvement of MTech students in CSE and possibly other engineering disciplines has been declining steadily over the years.

❖ Intrinsically, Computer Science is a rapidly evolving discipline. You have been working in the field for around three decades now. Could you give us some insight on how the field has changed over the years, especially with the evolution of A.I. and ML in recent years?

Prof. Sarawagi: The explosion of research papers on arXiv is a unique aberration of modern times. When I started my Ph.D., a star researcher would graduate with 4 papers; now, even 10 do not seem enough. The rigour with which papers were written and the sanctity of printed matter has been steadily reducing. Conferences are getting more heterogeneous, making it difficult to get good reviews. Many of these are specifically true for the field of AI/ML.

❖ Request you to impart some advice to students keen to pursue research, particularly in your field?

Prof. Sarawagi: In applied machine learning research, it is really important to get intimate with data. One should not just look at aggregate numbers and chase leader boards. I often get insights by inspecting data at a micro-level.
Second, I cannot overstate the importance of code-review and solid test cases to ensure that your numbers are reliable. Third, do not discard datasets just because the algorithm you propose does not perform well on that dataset. Rather, one should try to reason why your method did not work on that dataset. That can lead to one of two outcomes: (1) either you sharpen the specification of when your method is applicable, or (2) you improve the robustness of your method even further. Both outcomes are desirable for you individually and for the health of the field as a whole.

❖ You have worked with globally acclaimed organisations such as IBM Almaden Research, Google Research, and CMU. How was your experience working with these organisations, and how it has helped you shape your research at IIT Bombay?

Prof. Sarawagi: At IBM, I worked in applied areas like data mining, databases, and machine learning, where I did my own coding.
I held on to that habit when I joined IIT Bombay as a faculty member. For the first 15 years of my faculty tenure at IIT Bombay, I did a lot of programming and engaged with projects at a level of detail that is not possible via top-level discussions alone. I enjoyed that a lot and felt more in control of the research and its outcome. But the downside is that I did not have the advantage of the multiplication factor of working with a large team and forging many parallel collaboration threads. At CMU, I got a deeper appreciation of statistical machine learning and started my research on Semi-CRFs. During the sabbatical at Google Research at its headquarters in Mountain View (2014--2016), I came face to face with the deep learning wave and decided that resistance was futile. But such places also made me aware of the limited impact of ML research outside of the big data-rich research, talent-rich, and problem-rich corporations. Coming back to IIT Bombay, my enthusiastic students and younger colleagues inspired me to be resilient and keep striving despite that feeling.

❖ What will be your dream project that you would like to explore in the future?

Prof. Sarawagi: Currently, my dream project is whatever that will produce a high-quality Ph.D. thesis for my students. The field of machine learning is moving at such a dizzying pace that it is difficult to post targets that remain relevant even one year down the road.