IMFD research to be presented at SIGMOD, the leading international database conference – Millennium Institute Foundational Research on Data

December, 2023. One of the many lines of work developed at the Millennium Institute Foundational Research on Data multimodal graphs (one of the developments in this area is Millennium DB). Multimodal graphs are ways of structuring data, representing relationships between different types of nodes and connections, which allows different types of information to be modeled and represented. One example of where they can be used is in the analysis of transport networks in a given area: the nodes could be bus stops and the connections could be the routes between different means of transport: bicycles, cars, buses, etc. Analyzing this type of information in this format provides a much more complete and complex understanding than analyzing the same information in database formats.

Multimodal graphs contain more semantics than mere connectivity between nodes. This type of information is very common in real life, and to be consulted effectively, it requires enriching the language of pattern graphs with expressions specific to data types. For example, some nodes may be geographic locations and have associated location information, and I would like to be able to search the previous query, but restricted to pharmacies within a certain radius. Or that the current time is between its opening and closing hours. Or that the price is below a certain limit," explains Gonzalo Navarro, an academic at DCC U. Chile and one of the researchers at Millennium Institute Foundational Research on Data "Worst-Case-Optimal Similarity Joins on Graph Databases."

This is the paper recently accepted at the conference SIGMOD'24conference, on which DCC U. Chile academics Benjamín Bustos, Aidan Hogan, and Gonzalo Navarro worked together with DCC UC academics Diego Arroyuelo and Juan , and Adrián Gómez-Brandón from the University of A Coruña,all researchers at Millennium Institute Foundational Research on Data.

"The IMFD allows us to discuss and combine our efforts across different centers and universities," says Juan , director of the IMFD and one of the authors of the paper. "This achievement is a clear example of what we can accomplish by working together at the Millennium Institute Foundational Research on Data, as it is here that we can enhance the research we conduct independently at universities, creating a space for collaboration that allows the country to position itself at the forefront of research in Latin America on this and other topics."

ACM SIGMOD is one of the leading conferences on databases—along with ACM PODS—focusing on the principles, techniques, and applications of database management systems and technologies. In 2024, it will be held in Santiago, Chile.

Adding more information

Regarding this research, Gonzalo Navarro explains: "There has been a lot of recent work on so-called graph databases, which represent information in the form of nodes (objects) and edges that connect nodes (and represent relationships between objects), because they are more flexible than relational databases. These graphs are mainly queried by providing a small pattern graph, which has nodes and edges but where the identities of the nodes and edges can be variable. Each way of assigning values to the variables so that this pattern appears in the graph is a response to the query. As a very simplified example, in a graph with information about pharmacies, a query such as (F, sells, X), (X, component, paracetamol), (X, price, P), (F, has-agreement, Fonasa), gives me the pharmacies F with an agreement with Fonasa that sell a medicine X with paracetamol, and the price P at which they sell it. Responding to these queries efficiently is not easy, and it is not uncommon for some to take seconds or even minutes. A quality standard for these algorithms is to be worst-case optimal. This type of algorithm is relatively recent, and there is still a lot of research being done on them.

This scientific article addresses the problem of adding metric information to objects and proximity conditions to pattern graphs. It explains that metrics are generalizations of Euclidean distances (such as those in a plane or space), covering not only the example of geographical distance but also others such as similarity between images, audio, video, texts, etc. (which are modeled as distances in very high-dimensional spaces). The objective, Navarro points out, was also to maintain the worst-case-optimality condition of search algorithms.

In this way, researchers show that it is possible to include this type of information and support a particular type of query, "which is that of the k nearest neighbors. A graph is formed among the objects that have the metric, where each node knows which are the k nodes closest/most similar to it. The query can then ask for an object to be among the k closest/most similar to another. This allows us to ask, for example, for churches in my city whose facade resembles (as an image) one that interests me. We show that the graph of the k neighbors can be joined to the database graph in order to elegantly integrate it into the same worst-case-optimal algorithm, reusing in a certain way all the already known development on this algorithm. We show that, in several cases, we can retain worst-case-optimality, although not in all cases. We implement the new algorithm and show that it is significantly better than a basic solution. This opens the door to supporting other types of queries (by distance, as in the example of pharmacies) and other types of data (topological, geographical regions, temporal, etc.)," says the academic.

—
Source: DCC Communications