Paper co-authored by Tomasz Steifer accepted at major artificial intelligence conference

Since 1974, experts in data science, machine learning and natural language processing, among other areas, have been meeting at the European Conference on Artificial Intelligence (ECAI). The 2023 version of the event - considered one of the three largest conclaves in the world in its area - will be held in Krakow (Poland) and includes the presentation of several papers, including one prepared by researchers from the Institute of Mathematical and Computational Engineering (IMC) of the Catholic University of Chile and researchers from the Millennium Institute Foundational Research on Data. 

The paper accepted for ECAI 2023 -to be held between September 30 and October 5- is entitled "No Agreement Without Loss: Learning and Social Choice in Peer Review". Its authors are Pablo Barceló -director of IMC UC and researcher at the Millennium Institute Foundational Research on Data (IMFD)-, Mauricio Duarte -doctor in mathematics at U. Andrés Bello-, Cristóbal Rojass-academic at IMC UC and researcher at the National Center for Artificial Intelligence (Cenia)- and Tomasz Steifer, international postdoctoral fellow at IMC UC and IMFD.

As the researchers state in the presentation of the paper, in the peer review systems through which scientific studies pass before being published, reviewers are often "asked to evaluate various characteristics of the papers, such as technical quality or novelty. A score is assigned to each of the predefined characteristics and based on these, the reviewer has to provide an overall quantitative assessment." Tomasz Steifer, PhD in computer science at the Institute of Computer Science of the Polish Academy of Sciences, explains that this process is influenced by different factors, which in some cases are known as "biases" and introduce an element of arbitrariness.

For example, he notes, the so-called commensuration bias arises when reviewers "are asked to evaluate different and incomparable characteristics, such as novelty and technical correctness, and then to aggregate these incomparable scores into an overall score. This final score is often, at the same time, a recommendation to accept or reject." In this sense, the arbitrariness stems from the fact that reviewers "differ in how important different features are to them. If you submit a paper that gets excellent scores on feature A and a mediocre evaluation on feature B, you might be lucky and get a reviewer who thinks feature A is more important than feature B. But you might get unlucky and get a bad overall score, because my reviewers prefer function B to function A. Still, we would like to believe that the review process is not about being lucky, but only about the quality of an article. That's why frameworks like Noothigattu's, Shah's, and Procaccia's offer a very tantalizing promise to improve the review system, to make it less arbitrary, fairer to authors, and better at selecting good articles."

The method of these authors, from Carnegie Mellon and Harvard universities, was already applied last year at the 36th version of the AAAI conference on artificial intelligence. The purpose of the organizers was precisely to try to identify reviews that could present a significant commensuration bias.

As Steifer comments, the strategy proposed by the research team is to "look at how different reviewers map the function scores into an overall score and try to arrive at an aggregate assignment, a kind of 'average' assignment. The question then is how to do this aggregation, i.e., what parameters to use in their method". Normally, the IMC PhD student points out, when a method is proposed in the area of machine learning or artificial intelligence it is tested empirically: "Peer review is different because we don't have a standard of the highest quality information to compare with. We don't really know the 'real value' of a paper, and the best information we have is what the reviewers give us. So all these authors could do is present a theoretical/mathematical argument as to why their method is good."

In Noothigattu, Shah and Procaccia 's method, the authors put forward three axioms. " These can be considered as the minimal properties that an aggregation method should have, and they showed that among a variety of different parameters only one can satisfy all axioms at the same time," says the IMC and IMFD postdoc.

Steifer points out that the Carnegie Mellon and Harvard researchers presented some theoretical arguments explaining why their method is good if the parameters are well chosen. "Of course, in doing so, they made some assumptions that turned out to be oversimplified when checked against reality," he says. In the paper accepted in ECAI 2023, the IMC postdoc and his coauthors focused on "revisiting this theory and analyzing whether it really makes sense, once we change its assumptions to ones that are closer to what happens in real life. What we saw, then, is that their method really doesn't have all the good theoretical properties that they claimed to have."

In the paper, Steifer and his co-authors comment that the "results obtained are not only of theoretical interest, since the method in question was put into practice by the organizers of a major conference on artificial intelligence". Thus, looking to the future, the IMC UC and IMFD postdoc points out that several groups at major universities are currently working on methods designed to deal with the specific problems of peer review.

"I think we are just beginning to understand how and to what extent mathematical and computational methods can help us improve peer review. I am optimistic that, over time and as our understanding grows, we will be able to make this process better, fairer and ensure that good research is published and spread," says Tomasz Steifer

Source: IMC UC Communications