Gonzalo Navarro is featured in People of the ACM - Millennium Institute Foundational Research on Data

Gonzalo Navarro, an academic from the Department of Computer Science at the University of Chile and a researcher at the Millennium Institute of Computer Science at the University of Chile, has been working on a new research project in the field of computer science in Chile since August 2023. Millennium Institute Foundational Research on Datais named today as "People of the ACM", a distinction that highlights the unique scientific achievements and personal history of members of the Association for Computing Machinery (ACM) who are at the forefront and making a difference for the advancement of computing, both as a science and as a profession.

These newsletters feature and highlight People ACM members whose personal and professional stories are a source of inspiration for the computing community at large. Gonzalo Navarro is the third IMFD researcher to receive this recognition, which in May 2018 was given to Ricardo Baeza-Yates and in January 2021 to Marcelo Arenas.

The ACM is the world's largest educational and scientific computing society, providing resources that advance computing as a science and profession. ACM provides the premier digital library for the field of computing, and serves its members and the computing professions with cutting-edge publications, conferences, and professional resources.

In the interview, Gonzalo Navarro stresses the importance of teamwork for the development of his various research projects, as well as the need to choose the right research problems. Below is a translation of the original article in English.

People of the ACM

August 22, 2023
How did you initially become interested in algorithms and data structures?

It has been a long journey! I am Argentinean and I started my UNDERGRADUATE career at the National University of La Plata. That was in the 1980s - Computer Science careers were just taking off in Latin America and university resources were limited (we had a few computers for hundreds of students). Still, the program was solid and well thought out, and I am grateful for what it gave me. However, as is still the case in many places in Latin America, the curriculum in algorithms and data structures was very basic. After the third year I moved to ESLAI, a cutting-edge Computer Science educational project of UNDERGRADUATE in Argentina that selected a few dozen students per year and gave them full support with scholarships, equipment, housing and, most importantly, international professors. I had my first serious algorithms course with Giorgio Ausiello (still today working at the University of Rome La Sapienza), based on the book The Design and Analysis of Computer Algorithms by Aho, Hopcroft and Ullman, and it was like first love: you never forget it. A second course was given by Jorge Olivos, a brilliant professor from the University of Chile who was on sabbatical year in Argentina, and who finally defined where I live and work today.

After graduating and entering the job market, I realized how boring it was when you're not learning all the time. It was the 90's, and I guess working today in companies like Google and the like must be much more exciting. So I contacted Jorge Olivos for advice and he convinced me to go to Chile to do a master's degree. He was already semi-retired, so he put me in touch with Ricardo Baeza-Yates, who was returning from his PhD work at the University of Waterloo with a new and beautiful topic that became my second love: text search. I still remember my fascination with the elegance and beauty of the suffix array data structure. I did my MS and PhD with Ricardo Baeza-Yates in structured text retrieval and approximate string matching. Combining text search with text compression, my other main area of interest since then, came up with Nivio Ziviani, a colleague of Ricardo's from the Federal University of Minas Gerais, Brazil, and another prominent player in the development of web search engines in Latin America. My current interest generalizes those experiences and encompasses the whole area of compact data structures, a cross between data structures and information theory.

Your paper "On Compressing and Indexing Repetitive Sequences," (co-authored with Sebastian Kreft) was included in a compendium of the most cited papers in theoretical computer science. What challenge did you seek to address with this paper and how did the introduction of LZ-End solve it?

The versions of this paper at the conference date back to 2010, when the term "data avalanche" was being heard everywhere. I was beginning to realize that much of the fastest growing text data, had little information content because the text collections were so repetitive, and this gave us the opportunity to stem the data avalanche in those cases. Think of genome collections or same-species reads in bioinformatics, or versioned document collections like Wikipedia, GitHub or the Software Heritage project. Today, a decade later, with projects to sequence a million human genomes and the promise of personalized medicine on the horizon, the problem of indexing huge repetitive collections is getting a lot of attention, from practice in bioinformatics labs to theory in the study of repeatability and how it appears in text indexing data structures.

At that time, the concept of self-indexes, which are compressed representations of texts that can be efficiently accessed and searched without even decompressing them, had existed for a decade in the form of compressed suffix arrays. However, these auto-indexes focused on statistical compression, which is blind to repeatability. Other compression models, such as dictionary-based compression, captured repetition, but accessing the text in compressed form was much more challenging, and even more challenging was searching for it without decompressing it. We had already obtained good results with techniques such as grammar-based compression or suffix arrays with run-length compression, but not with the big prize: the original 1976 Ziv-Lempel parsing, the basis of compressors such as gzip and p7zip, provided the best compression on repetitive sequences, and was the most difficult model to use.

Sebastian Kreft was an outstanding master's student and took on this challenge. We figured out how to adapt old ideas of Juha Kärkkäinen and Esko Ukkonen (whom I knew from my postdoc at the University of Helsinki in 1999) to reduce text searches to geometric queries and permutations, and to efficiently access compressed text, inventing the LZ-End variant of Lempel-Ziv. Our compressed text index still offers the best space in repetitive collections, with competitive search times.

Your paper "Colored Range Queries and Document Retrieval," (co-authored with Travis Gagie, Juha Kärkkäinen and Simon J. Puglisi) is also recognized for making important contributions to text search. What are colored range queries and what were the key insights from this paper?

Coloring is just a "colorful" name for categorizing objects into classes. You have a set of objects, each with a "color" (its class) and you perform queries that group them by color. In color range queries, you have a variety of colors and you want to answer questions like "What colors appear in this range?" "Which are the most frequent?" and so on. Abstract problems of this type are studied in theoretical computing and their solutions can be applied to a wide range of situations.

In this paper, we gave new solutions to these problems and then showed how they could be applied to document retrieval. This area extends, in a sense, the classical discipline of information retrieval, which works on natural language text, to apply to collections of general sequences. It then answers queries such as "Which sequences contain this short pattern?" "Which are those in which it appears most frequently?" and so on, which make sense in areas such as bioinformatics (e.g., finding genes where some DNA marker appears often), data mining (e.g., finding hot topics in a social network over a period of time), etc.

Is there any recent research you would like to highlight?

I would like to refer to the paper "Worst-Case Optimal Graph Joins in Almost No Space", which appeared in ACM SIGMOD 2021, because it pertains to the area of graph databases where I recently started working in the context of my work with the IMFD in Chile. For me, it meant the discovery of a new area with incredibly rich algorithmic content where the avalanche of data puts pressure on the amount of space used by efficient indexes. The best indexes require multiplying the data space to provide different views of the same objects and are therefore impractical with today's data volumes.

This is the perfect scenario to apply compact data structures. Classical structures are highly redundant and compact data structures aim to provide the same functionality while using a space close to the entropy of the data - its true information content - thus eliminating redundancy. We managed to develop new indexes that use almost zero space in addition to the size of simple data, and even compress the data, while supporting database operations competitively and sometimes faster than current systems, which use an order of magnitude more space. I learned a lot in the process as I entered a new area of computing, and my colleagues learned a lot in terms of sophisticated data structures. As one colleague put it, "it's like you've brought in an alien technology." It has been an immensely satisfying adventure that has only just begun.

You have been prolific in terms of authorship. What advice would you give to a younger colleague in terms of work habits for a productive career?

The most important advice is: work with other People! At least for me, the difference between finding and developing ideas alone versus doing it while conversing and meeting with colleagues is huge. Even in cases where ideas are originally mine, they emerge and flow much better in inspiring conversations rather than being caged in my head. An innocent question, a link to an idea from some other area that your colleague knows better, a key observation about something that was right under your nose that you hadn't seen, makes all the difference. And at other times, by sharing an idea with others when you are limited in how to approach a problem, you can understand it better and come up with better solutions, etc. Go to conferences and meet people, do extended stays in other places and learn how other teams work. You will find this much more rewarding than working alone.

My second piece of advice is: choose your problems well. Sometimes you find interesting ideas lost in mediocre publications because the problems are not that relevant. A more experienced colleague, whom you can judge by the impact of his or her publications, can guide you to problems worth investing your talent in. Still, try to work on what you really enjoy, as it is much easier to be good at things that excite you. And your life will be much more colorful!

Original interview in People of the ACM