IMFD researchers present advances in search systems for graphs at SIGMOD/PODS

"Worst-Case-Optimal Similarity Joins on Graph Databases" is the title of the research presented at SIGMOD/PODS 2024 by researchers from Millennium Institute Foundational Research on Data (IMFD) researchers Benjamín Bustos, Aidan Hogan (IMFD alternate director), and Gonzalo Navarro from the University of Chile's Computer Science Department; Diego Arroyuelo and IMFD director Juan from the Catholic University's Computer Science Department; and Adrián Gómez-Brandón (University of A Coruña).

Regarding this research, Gonzalo Navarro explains: "There has been a lot of recent work on so-called graph databases, which represent information in the form of nodes (objects) and edges that connect nodes (and represent relationships between objects), because they are more flexible than relational databases. These graphs are mainly queried by providing a small pattern graph, which has nodes and edges but where the identities of the nodes and edges can be variable. Each way of assigning values to the variables so that this pattern appears in the graph is a response to the query. As a very simplified example, in a graph with information about pharmacies, a query such as (F, sells, X), (X, component, paracetamol), (X, price, P), (F, has-agreement, Fonasa), gives me the pharmacies F with an agreement with Fonasa that sell a medicine X with paracetamol, and the price P at which they sell it. Responding to these queries efficiently is not easy, and it is not uncommon for some to take seconds or even minutes. A quality standard for these algorithms is to be worst-case optimal. This type of algorithm is relatively recent, and there is still a lot of research being done on them.

In this context, the academic explains that one line of work developed at the IMFD is called multimodal graphs, "which contain more semantics than mere connectivity between nodes. This type of information is very common in real life, and to be consulted effectively, it requires enriching the language of pattern graphs with specific expressions of data types. For example, some nodes may be geographic locations and have associated location information, and I would like to be able to search for the previous query, but restricted to pharmacies within a certain radius. Or that the current time is between their opening and closing hours. Or that the price is below a certain limit."

Professor Gonzalo Navarro points out that this scientific article addresses the problem of adding metric information to objects and conditions of proximity to pattern graphs. He explains that metrics are generalizations of Euclidean distances (such as those in a plane or space), which encompass not only geographical distance but also other types of distance, such as similarity between images, audio, video, texts, etc. (which are modeled as distances in very high-dimensional spaces). "We also wanted to maintain the worst-case-optimality condition of the search algorithms," he says.

In this way, researchers show that it is possible to include this type of information and support a particular type of query, "which is that of the k nearest neighbors. A graph is formed among the objects that have the metric, where each node knows which are the k nodes closest/most similar to it. The query can then ask for an object to be among the k closest/most similar to another. This allows us to ask, for example, for churches in my city whose facade resembles (as an image) one that interests me. We show that the graph of the k neighbors can be joined to the database graph in order to elegantly integrate it into the same worst-case-optimal algorithm, reusing in a certain way all the already known development on this algorithm. We show that, in several cases, we can retain worst-case-optimality, although not in all cases. We implement the new algorithm and show that it is significantly better than a basic solution. This opens the door to supporting other types of queries (by distance, as in the example of pharmacies) and other types of data (topological, geographical regions, temporal, etc.)," says the academic.

SOURCE: Communications DCC University of Chile