IMFD researchers' work awarded at SIGMOD/PODS - Millennium Institute Foundational Research on Data

Every year, the Association for Computing Machinery (ACM) -founded in 1947 as the first scientific and educational society in the field of computing-organizes the SIGMOD/PODS Conference. Today, the event is considered one of the most important international forums in the area of data management, whose researchers gather to explore new ideas, results, techniques and experiences. It is in this context that the best papers presented are also awarded: in the 2023 version, which takes place between June 18 and 23 in Seattle (USA), one of these awards goes to a study co-authored by Domagoj Vrgoč, academic of the Institute of Mathematical and Computer Engineering of the Catholic University of Chile and researcher of the Millennium Institute Foundational Research on Data (IMFD) and Renzo Angles, academic of the Department of Computer Science of the University of Talca and IMFD researcher.

According to the organizers, the study entitled "PG-Schema: Schemas for Property Graphs" was chosen as the "Best Paper of the Industry Track" due to its exceptional quality, originality and contribution to the area of graph databases, and is the result of the joint work of researchers from several higher education institutions such as the University of Warsaw (Poland), the University of Bayreuth (Germany), the University of Edinburgh (Scotland) and companies such as Amazon Web Services, TigerGraph, Neo4J and RelationalAI, among others.

"SIGMOD/PODS is one of the largest conferences in the world and one of the most prestigious in the database field. Each year it brings together close to 2,000 participants. It has a section that is SIGMOD, which covers the more practical side, and PODS, which focuses on the more theoretical field," says Domagoj Vrgoč, PhD in computer science from the University of Edinburgh (Scotland). As for the paper itself, he points out that the award alludes to the fact that the study is a "work that has quite an impact on the industry and is done in collaboration with people from the industry. In fact, it is a paper that has more than 20 authors, which represents a very large collaboration that took a long time and solves a concrete problem that exists in the area".

Renzo Angles comments that this article was developed by members of the "Property Graph Schema Working Group" within the Linked Data Benchmark Council (LDBC) "In mid-2019, we began to discuss the characteristics of graph-based data models, and the absence of a standard way to represent their structure or schema. In this regard, the paper proposes a formalism for specifying schemas for graphs with properties, i.e., a language that allows us to precisely describe the types of nodes, edges and properties existing in a graph-based database, in addition to specifying simple and complex constraints on those types and their relationships."

The potential of the study

In the database area, there is a very important branch known as graph databases, in which data is modeled conceptually. "Every entity you want to represent, like a person, a city or a workplace, is going to be a node in your graph. And when you want to link data you put edges that tell you what the connection is between the different entities. That means it 's a model that doesn't have a fixed structure; when you want to add a new entity, you simply connect it through edges," Vrgoč points out.

This characteristic means that it is not necessary to have a fixed structure, as is the case in the more classical area of this field of research, which includes relational databases. "Graph databases do not have a schema, which is understood as a description that tells you 'everything looks like this' and which in the world of relational databases does exist in a very strong way," the academic comments. On the other hand, he adds, in a graph database there may be "nodes that represent People, but some include only name and country, and others only show a name and age. Therefore, it is not necessary for everything to be structured. In relational databases, on the other hand, everything must have the same attributes".

According to Vrgoč, that feature gives graph databases a lot of flexibility, but it can also create a problem "when there is a very large knowledge graph, where you do need a schema that tells you what kind of data you have."

The study helps fill that gap. "The paper is called PG-Schema because it alludes to a language that allows you to define the schema for a graph database format that is quite busy in the industry and is called 'property graphs.' And the work is effectively that, a language that allows in a compact way to describe what kind of data I have in my database without having to show all that data. It relies on a certain syntax, develops semantics and makes it easy to do that definition."

Vrgoč's contribution to the paper was to establish a grammar for that language: "The work I did with a subgroup of that international team, mainly with Filip Murlak (U. of Warsaw, Poland) and Wim Martens (U. of Bayreuth, Germany), was to design a base language that allows me to describe what I have in a node, what I can have in an edge, how they are linked, how my graph looks in general. Then, with the rest of the team we developed several extensions that in the end lead to this language".

"The development of the article took quite a long time because there was a discussion between the real needs raised by the members of the group working in the industry, and the theoretical foundations raised by the members belonging to the academia. The final result is a schema specification language that allows representing graph schemas with different types of constraints, but respecting important theoretical conditions", says Renzo Angles, who is confident that the article will have an impact on the development of schema specification languages for graph-based database systems.

The researchers hope that, because of its potential, this language will be incorporated into a new ISO standard for graph query language. "Our work is a proposal with an input to the group defining that standard, but it is not yet something that is established in the industry. For now, it is research work," says Domagoj Vrgoč.

All the authors of the study are: Renzo Angles (Universidad de Talca); Angela Bonifati (Univ. of Lyon); Stefania Dumbrava (ENSIIE); George Fletcher (Eindhoven University of Technology the Netherlands); Alastair Green (Mr); Jan Hidders (Birkbeck, University of London)*; Bei Li (Google); Leonid Libkin (University of Edinburgh & RelationalAI); Victor Marsault (UPEM / CNRS); Wim Martens (University of Bayreuth); Filip Murlak (University of Warsaw, Poland); Stefan Plantikow (Neo4j); Ognjen Savkovic (Free University of Bozen-Bolzano); Michael Schmidt (Amazon Web Services); Juan Sequeda (data.world); Sławek Staworko (RelationalAI); Dominik Tomaszuk (University of Bialystok); Hannes Voigt (Neo4j); Domagoj Vrgoc (Pontificia Universidad Catolica de Chile); Mingxi Wu (Tigergraph inc); Dušan Živković (Integral Data Solutions).

Source: IMC UC