Filtrar por
Capítulo de libro
Revista científica
Proyectos emblemáticos
Creación de nuevos lenguajes de consulta para redes de información
Datos para el estudio de problemas sociales complejos
Extracción eficiente de datos en escenarios de alta complejidad
Generación de estructuras de información robusta
Inteligencia artificial con explicación
Knowledge Graphs
Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d'Amato, Gerard de Melo, Claudio Gutierrez, José Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier, Axel Polleres, Roberto Navigli, Axel-Cyrille Ngonga Ngomo, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, Antoine Zimmermann. (

In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graph

GENE: Graph generation conditioned on named entities for polarity and controversy detection in social media
Marcelo Mendoza, Denis Parra, Álvaro Soto. Information Processing & Management, Volume 57, Issue 6, 2020, 102366, ISSN 0306-4573.

Abstract: Many of the interactions between users on social networks are controversial, specially in polarized environments. In effect, rather than producing a space for deliberation, these environments foster the emergence of users that disqualify the position of others. On news sites, comments on the news are characterized by such interactions. This is detrimental to the construction of a deliberative and democratic climate, stressing the need for automatic tools that can provide an early detection of polarization and controversy. We introduce GENE (graph generation conditioned on named entities), a representation of user networks conditioned on the named entities (personalities, brands, organizations) which users comment upon. GENE models the leaning that each user has concerning entities mentioned in the news. GENE graphs is able to segment the user network according to their polarity. Using the segmented network, we study the performance of two controversy indices, the existing Random Walks Controversy (RWC) and another one we introduce, Relative Closeness Controversy (RCC). These indices measure the interaction between the network’s poles providing a metric to quantify the emergence of controversy. To evaluate the performance of GENE, we model the network of users of a popular news site in Chile, collecting data in an observation window of more than three years. A large-scale evaluation using GENE, on thousands of news, allows us to conclude that over 60% of user comments have a predictable polarity. This predictability of the user interaction scenario allows both controversy indices to detect a controversy successfully. In particular, our introduced RCC index shows satisfactory performance in the early detection of controversies using partial information collected during the first hours of the news event, with a sensitivity to the target class exceeding 90%.

In-Database Graph Analytics with Recursive SPARQL
Hogan A., Reutter J.L., Soto A. (2020) In-Database Graph Analytics with Recursive SPARQL. In: Pan J.Z. et al. (eds) The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science, vol 12506. Springer, Cham.

Works on knowledge graphs and graph-based data management often focus either on graph query languages or on frameworks for graph analytics, where there has been little work in trying to combine both approaches. However, many real-world tasks conceptually involve combinations of these approaches: a graph query can be used to select the appropriate data, which is then enriched with analytics, and then possibly filtered or combined again with other data by means of a query language. In this paper we propose a language that is well-suited for both graph querying and analytical tasks. We propose a minimalistic extension of SPARQL to allow for expressing analytical tasks over existing SPARQL infrastructure; in particular, we propose to extend SPARQL with recursive features, and provide a formal syntax and semantics for our language. We show that this language can express key analytical tasks on graphs (in fact, it is Turing complete). Moreover, queries in this language can also be compiled into sequences of iterations of SPARQL update statements. We show how procedures in our language can be implemented over off-the-shelf SPARQL engines, with a specialised client that can leverage database operations to improve the performance of queries. Results for our implementation show that procedures for popular analytics currently run in seconds or minutes for selective sub-graphs (our target use-case).

Global Vertex Similarity for Large-Scale Knowledge Graphs
Marco Caballero and Aidan Hogan; DCC, Universidad de Chile & IMFD. CEUR Workshop Proceedings.

Abstract. We investigate global measures of vertex similarity for knowledge graphs. While vertex similarity has been explored in the context of directed, unlabelled graphs, measures based on recursive algorithms or learning frameworks can be costly to compute, assume labelled data, and/or provide poorly-interpretable results. Knowledge graphs further imply unique challenges for vertex similarity in terms of scale and diversity. We thus propose and explore global measures of vertex similarity for Knowledge Graphs that (i) are unsupervised, (ii) offer explanations of similarity results; (iii) take into consideration edge labels; and (iv) are robust in terms of redundant or interdependent information. Given that these measures can still be costly to compute precisely, we propose an approximation strategy that enables computation at scale. We compare our measures with a recursive measure (SimRank) for computing vertex similarity over subsets of Wikidata.

Versioned Queries over RDF Archives: All You Need is SPARQL?
Ignacio Cuevas and Aidan Hogan; Department of Computer Science, University of Chile & IMFD Chile.

Abstract. We explore solutions for representing archives of versioned RDF data using the SPARQL standard and off-the-shelf engines. We consider six representations of RDF archives based on named graphs, and describe how input queries can be automatically rewritten to return solutions for a particular version, or solutions that change between versions. We evaluate these alternatives over an archive of 8 weekly versions of Wikidata and 146 queries using Virtuoso as the SPARQL engine.

Suggesting Citations for Wikidata Claims basedon Wikipedia’s External References
Paolo Curotto and Aidan Hogan. CEUR Workshop Proceedings.

Abstract. Given a Wikidata claim, we explore automated methods for locating references that support that claim. Our goal is to assist human editors in referencing claims, and thus increase the ratio of referenced claims in Wikidata. As an initial approach, we mine links from the references section of English Wikipedia articles, download and index their content, and use standard relevance-based measures to find supporting documents. We consider various forms of search phrasings, as well as different scopes of search. We evaluate our methods in terms of the coverage of reference documents collected from Wikipedia. We also develop a gold standard of sample items for evaluating the relevance of suggestions. Our results in general reveal that the coverage of Wikipedia reference documents for claims is quite low, but where a reference document is available, we can often suggest it within the first few results.

Laconic Image Classification: Human vs. Machine Performance
Javier Carrasco, Aidan Hogan, Jorge Pérez. CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge ManagementOctober 2020 Pages 115–124

We propose laconic classification as a novel way to understand and compare the performance of diverse image classifiers. The goal in this setting is to minimise the amount of information (aka. entropy) required in individual test images to maintain correct classification. Given a classifier and a test image, we compute an approximate minimal-entropy positive image for which the classifier provides a correct classification, becoming incorrect upon any further reduction. The notion of entropy offers a unifying metric that allows to combine and compare the effects of various types of reductions (e.g., crop, colour reduction, resolution reduction) on classification performance, in turn generalising similar methods explored in previous works. Proposing two complementary frameworks for computing the minimal-entropy positive images of both human and machine classifiers, in experiments over the ILSVRC test-set, we find that machine classifiers are more sensitive entropy-wise to reduced resolution (versus cropping or reduced colour for machines, as well as reduced resolution for humans), supporting recent results suggesting a texture bias in the ILSVRC-trained models used. We also find, in the evaluated setting, that humans classify the minimal-entropy positive images of machine models with higher precision than machines classify those of humans.

Knowledge Graphs: Research Directions
Hogan A. (2020) Knowledge Graphs: Research Directions. In: Manna M., Pieris A. (eds) Reasoning Web. Declarative Artificial Intelligence. Reasoning Web 2020. Lecture Notes in Computer Science, vol 12258. Springer, Cham.

In these lecture notes, we provide an overview of some of the high-level research directions and open questions relating to knowledge graphs. We discuss six high-level concepts relating to knowledge graphs: data models, queries, ontologies, rules, embeddings and graph neural networks. While traditionally these concepts have been explored by different communities in the context of graphs, more recent works have begun to look at how they relate to one another, and how they can be unified. In fact, at a more foundational level, we can find some surprising relations between the different concepts. The research questions we explore mostly involve combinations of these concepts.

Web Ontology Language
Hogan A. (2020) Web Ontology Language. In: The Web of Data. Springer, Cham.

In this chapter, we provide a detailed primer on the second version of the Web Ontology Language (OWL 2) standard. We first motivate the need for such a standard, discussing the role and importance of ontologies on the Web. We then describe how ontology languages, which themselves can be formally defined through model theory, can subsequently be used to formally define ontologies. Thereafter we discuss the OWL vocabulary used to define the semantics of classes, properties, individuals, and datatypes within ontologies. We cover some of the main reasoning tasks for ontologies and the applications in which they are used. We discuss how these core reasoning tasks are undecidable for the full OWL (2) language and outline the sub-languages (aka. profiles) proposed by the standard that allow for more efficient reasoning procedures. We conclude by reflecting on the importance of having expressive ontologies on the Web of Data, and discuss open challenges.

The Web of Data
The Web of Data. Author: Aidan Hogan, Department of Computer ScienceUniversidad de ChileSantiago de Chile, IMFD.

This book concisely brings together the key standards and best practices relating to modelling, querying, validating and linking machine-readable data and semantics on the Web. Alongside practical examples and formal definitions, the book shows how these standards contribute to – and have been used thus far on – the «Web of Data»: a machine readable evolution of the Web marked by increased automation, enabling powerful Web applications capable of discovering, cross-referencing, and organising data from numerous websites in a matter of seconds. The book is divided into nine chapters, the first of which highlights the fundamental shortcomings of the current Web that illustrate the need for increased machine readability. The next chapter outlines the core concepts of the “Web of Data”, discussing use-cases on the Web where they have already been deployed. “Resource Description Framework (RDF)” describes the graph-structured data model proposed by the Semantic Web community as a common data model for the Web. The chapter on “RDF Schema (RDFS) and Semantics” presents a lightweight ontology language used to define an initial semantics for RDF graphs. In turn, the chapter “Web Ontology Language (OWL)” elaborates on a much more expressive ontology language built upon RDFS. In “SPARQL Query Language” a language for querying and updating RDF graphs is described. “Shape Constraints and Expressions (SHACL/ShEx)” introduces two languages for describing the expected structure of – and expressing constraints over – RDF graphs for the purposes of validation. “Linked Data” discusses the principles and best practices by which interlinked (RDF) data can be published on the Web, and how they have been adopted. The final chapter highlights open problems and concludes with a general discussion on the future of the Web of Data. The book is intended for students, researchers and advanced practitioners interested in learning more about the Web of Data, and about closely related topics such as the Semantic Web, Knowledge Graphs, Linked Data, Graph Databases, Ontologies, etc. Offering a range of accessible examples and exercises, it can be used as a textbook for students and other newcomers to the field. It can also serve as a reference handbook for researchers and developers, as it offers up-to-date details on key standards (RDF, RDFS, OWL, SPARQL, SHACL, ShEx, RDB2RDF, LDP), along with formal definitions and references to further literature. The associated website offers a wealth of complementary material, including solutions to the exercises, slides for classes, interactive examples, and a section for comments and questions.

SPARQL Query Language
Hogan A. (2020) SPARQL Query Language. In: The Web of Data. Springer, Cham.

This chapter provides a detailed introduction to the SPARQL Protocol and RDF Query Language (SPARQL 1.1): the standard query language for RDF. After some initial motivation, we delve into the features of the query language, illustrated with concrete examples. We then formally define the semantics of these query features. We next discuss how federated queries can be used to evaluate queries over multiple remote sources on the Web. We detail the SPARQL Update language, which allows for modifying the data indexed by a SPARQL query service. We introduce SPARQL Entailment Profiles, which allow for query results to consider entailments, including support for RDF, RDFS and OWL semantics. We further discuss the HTTP-based protocol by which requests can be issued to a SPARQL service over the Web, as well as the SPARQL Service Description vocabulary, which can be used to describe and advertise the features supported by such services. We conclude by discussing the importance of SPARQL for the Web of Data, the key research directions that are currently being explored, as well as open challenges.

Shape Constraints and Expressions
Hogan A. (2020) Shape Constraints and Expressions. In: The Web of Data. Springer, Cham.

In this chapter, we introduce two languages for describing shapes and constraints for RDF graphs, namely the Shapes Constraint Language (SHACL) and the Shape Expressions Language (ShEx 2.1). Both languages allow for defining constraints over RDF graphs in terms of what data are expected, what data are obligatory, what data are allowed, and what data are disallowed. This in turn allows RDF graphs to be validated with respect to the specified constraints. We first look at SHACL, describing the SHACL-Core fragment and the constraints it allows. We then discuss how SHACL-SPARQL allows for further constraints to be expressed using SPARQL query syntax. Turning to ShEx, we describe its syntaxes, and how it differs from SHACL. We outline and provide a semantics for an abstract shapes syntax that generalises SHACL and ShEx. We conclude with a general discussion of the role of shapes languages on the Web of Data, as well as open challenges.

Resource Description Framework
Hogan A. (2020) Resource Description Framework. In: The Web of Data. Springer, Cham.

This chapter provides a detailed primer for the Resource Description Framework (RDF 1.1) standard, proposed as a common data model for publishing and exchanging structured data on the Web. We first motivate the need for a data model like RDF. We then describe the types of terms used in RDF: the basic building blocks of the framework. We discuss how these terms can be combined to make coherent statements in the form of RDF triples, and how triples form graphs and datasets. Thereafter we discuss the RDF vocabulary: a built-in set of terms used for modeling more complex data, such as complex relations and ordered lists. Finally, we give an overview of the different syntaxes by which RDF can be serialized and communicated.

RDF Schema and Semantics
Hogan A. (2020) RDF Schema and Semantics. In: The Web of Data. Springer, Cham.

This chapter presents an in-depth primer for the RDF Schema (RDFS 1.1) standard, which is primarily used to define a lightweight semantics for the classes and properties used in RDF graphs. After an initial motivation and overview, we discuss the RDFS vocabulary, and how it can be used to define sub-classes, sub-properties, domain and ranges, amongst other types of definitions. We then describe in detail how the semantics of RDF(S) can be formalised in a model-theoretic way, discussing key concepts such as interpretations, models, satisfiability and entailment. We introduce various semantics for RDF(S), including the simple semantics, D semantics, RDF semantics, and the RDFS semantics. We conclude the chapter by discussing how rules can be used to support entailment under such semantics.

Linked Data
Hogan A. (2020) Linked Data. In: The Web of Data. Springer, Cham.

This chapter motivates, introduces and describes Linked Data, which centres around a concise set of principles by which data can be published and interlinked on the Web, and by which a Web of Data can ultimately be formed. We first discuss the core Linked Data principles, which espouse the use of HTTP IRIs to identify the entities described in data, returning a machine-readable description of the entity (typically RDF) when its corresponding IRI is looked up on the Web. We then discuss some further best practices for publishing data conformant with the Linked Data principles in a way that enhances interoperability. We discuss the Linking Open Data (LOD) project founded on the idea of publishing Open Data on the Web in a standard, machine-readable fashion using Linked Data; we describe the most prominent datasets and vocabularies that have results from this initiative. We then discuss tools and techniques for converting legacy data to RDF, discovering links, and hosting Linked Data. We subsequently discuss the Linked Data Platform: a standard that outlines the protocols and resources needed to build a new generation of read–write Linked Data applications. We conclude the chapter with a discussion of open challenges yet to be addressed in the context of Linked Data.

Explaining VQA predictions using visual grounding and a knowledge base
Felipe Riquelme, Alfredo De Goyeneche, Yundong Zhang, Juan Carlos Niebles, Alvaro Soto. Image and Vision Computing, Volume 101, 2020, 103968, ISSN 0262-8856.

Abstract: In this work, we focus on the Visual Question Answering (VQA) task, where a model must answer a question based on an image, and the VQA-Explanations task, where an explanation is produced to support the answer. We introduce an interpretable model capable of pointing out and consuming information from a novel Knowledge Base (KB) composed of real-world relationships between objects, along with labels mined from available region descriptions and object annotations. Furthermore, this model provides a visual and textual explanations to complement the KB visualization. The use of a KB brings two important consequences: enhance predictions and improve interpretability. We achieve this by introducing a mechanism that can extract relevant information from this KB, and can point out the relations better suited for predicting the answer. A supervised attention map is generated over the KB to select the relevant relationships from it for each question-image pair. Moreover, we add image attention supervision on the explanations module to generate better visual and textual explanations. We quantitatively show that the predicted answers improve when using the KB; similarly, explanations improve with this and when adding image attention supervision. Also, we qualitatively show that the KB attention helps to improve interpretability and enhance explanations. Overall, the results support the benefits of having multiple tasks to enhance the interpretability and performance of the model.

Fine-Grained Entity Linking
Henry Rosales-Méndez; Aidan Hogan; Barbara Poblete. IMFD, Chile. Department of Computer Science, University of Chile, Chile. Journal of Web Semantics, Volume 65, December 2020, 100600.

The Entity Linking (EL) task involves linking mentions of entities in a text with their identifier in a Knowledge Base (KB) such as Wikipedia, BabelNet, DBpedia, Freebase, Wikidata, YAGO, etc. Numerous techniques have been proposed to address this task down through the years. However, not all works adopt the same convention regarding the entities that the EL task should target; for example, while some EL works target common entities like “interview” appearing in the KB, others only target named entities like “Michael Jackson”. The lack of consensus on this issue (and others) complicates research on the EL task; for example, how can the performance of EL systems be evaluated and compared when systems may target different types of entities? In this work, we first design a questionnaire to understand what kinds of mentions and links the EL research community believes should be targeted by the task. Based on these results we propose a fine-grained categorization scheme for EL that distinguishes different types of mentions and links. We propose a vocabulary extension that allows to express such categories in EL benchmark datasets. We then relabel (subsets of) three popular EL datasets according to our novel categorization scheme, where we additionally discuss a tool used to semi-automate the labeling process. We next present the performance results of five EL systems for individual categories. We further extend EL systems with Word Sense Disambiguation and Coreference Resolution components, creating initial versions of what we call Fine-Grained Entity Linking (FEL) systems, measuring the impact on performance per category. Finally, we propose a configurable performance measure based on fuzzy sets that can be adapted for different application scenarios Our results highlight a lack of consensus on the goals of the EL task, show that the evaluated systems do indeed target different entities, and further reveal some open challenges for the (F)EL task regarding more complex forms of reference for entities.

Differentiable adaptive computation time for visual reasoning
C. Eyzaguirre and Á. Soto, "Differentiable Adaptive Computation Time for Visual Reasoning," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12814-12822,

Abstract: This paper presents a novel attention-based algorithm for achieving adaptive computation called DACT, which, unlike existing ones, is end-to-end differentiable. Our method can be used in conjunction with many networks; in particular, we study its application to the widely know MAC architecture, obtaining a significant reduction in the number of recurrent steps needed to achieve similar accuracies, therefore improving its performance to computation ratio. Furthermore, we show that by increasing the maximum number of steps used, we surpass the accuracy of even our best non-adaptive MAC in the CLEVR dataset, demonstrating that our approach is able to control the number of steps without significant loss of performance. Additional advantages provided by our approach include considerably improving interpretability by discarding useless steps and providing more insights into the underlying reasoning process. Finally, we present adaptive computation as an equivalent to an ensemble of models, similar to a mixture of expert formulation. Both the code and the configuration files for our experiments are made available to support further research in this area.

Information extraction meets the Semantic Web: A survey
Martinez-Rodriguez, Jose L., Cinvestav Tamaulipas, Ciudad Victoria, Mexico; Hogan, Aidan, IMFD Chile; Department of Computer Science, University of Chile, Chile; Lopez-Arevalo, Ivana, Cinvestav Tamaulipas, Ciudad Victoria, Mexico. Journal Semantic Web, vol. 11, no. 2, pp. 255-335, 2020.

Abstract: We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Semantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Semantic Web resources (languages/ontologies/knowledge-bases/tools) to improve Information Extraction, and/or using Information Extraction to populate the Semantic Web. In more detail, we focus on the extraction and linking of three elements: entities, concepts and relations. Extraction involves identifying (textual) mentions referring to such elements in a given unstructured or semi-structured input source. Linking involves associating each such mention with an appropriate disambiguated identifier referring to the same element in a Semantic Web knowledge-base (or ontology), in some cases creating a new identifier where necessary. With respect to entities, works involving (Named) Entity Recognition, Entity Disambiguation, Entity Linking, etc. in the context of the Semantic Web are considered. With respect to concepts, works involving Terminology Extraction, Keyword Extraction, Topic Modeling, Topic Labeling, etc., in the context of the Semantic Web are considered. Finally, with respect to relations, works involving Relation Extraction in the context of the Semantic Web are considered. The focus of the majority of the survey is on works applied to unstructured sources (text in natural language); however, we also provide an overview of works that develop custom techniques adapted for semi-structured inputs, namely markup documents and web tables.

CompactNets: Compact Hierarchical Compositional Networks for Visual Recognition
Hans Lobel, Alvaro Soto, Pontificia Universidad Católica de Chile; René Vidal, The Johns Hopkins University. Computer Vision and Image Understanding, Volume 191, February 2020, 102841.

Abstract: CNN-based models currently provide state-of-the-art performance in image categorization tasks. While these methods are powerful in terms of representational capacity, they are generally not conceived with explicit means to control complexity. This might lead to scenarios where resources are used in a non-optimal manner, increasing the number of unspecialized or repeated neurons, and overfitting to data. In this work we propose CompactNets, a new approach to visual recognition that learns a hierarchy of shared, discriminative, specialized, and compact representations. CompactNets naturally capture the notion of compositional compactness, a characterization of complexity in compositional models, consisting on using the smallest number of patterns to build a suitable visual representation. We employ a structural regularizer with group-sparse terms in the objective function, that induces on each layer, an efficient and effective use of elements from the layer below. In particular, this allows groups of top-level features to be specialized based on category information. We evaluate CompactNets on the ILSVRC12 dataset, obtaining compact representations and competitive performance, using an order of magnitude less parameters than common CNN-based approaches. We show that CompactNets are able to outperform other group-sparse-based approaches, in terms of performance and compactness. Finally, transfer-learning experiments on small-scale datasets demonstrate high generalization power, providing remarkable categorization performance with respect to alternative approaches.

The Semantic Web: Two decades on
Hogan, Aidan; IMFD; Department of Computer Science, University of Chile, Santiago, Chile. Journal Semantic Web, vol. 11, no. 1, pp. 169-185, 2020.

Abstract: More than two decades have passed since the establishment of the initial cornerstones of the Semantic Web. Since its inception, opinions have remained divided regarding the past, present and potential future impact of the Semantic Web. In this paper – and in light of the results of over two decades of development on both the Semantic Web and related technologies – we reflect on the current status of the Semantic Web, the impact it has had thus far, and future challenges. We first review some of the external criticism of this vision that has been put forward by various authors; we draw together the individual critiques, arguing both for and against each point based on the current state of adoption. We then present the results of a questionnaire that we have posed to the Semantic Web mailing list in order to understand respondents’ perspective(s) regarding the degree to which the original Semantic Web vision has been realised, the impact it can potentially have on the Web (and other settings), its success stories thus far, as well as the degree to which they agree with the aforementioned critiques of the Semantic Web in terms of both its current state and future feasibility. We conclude by reflecting on future challenges and opportunities in the area.

Socialized for News Media Use: How Family Communication, Information-Processing Needs, and Gratifications Determine Adolescents’ Exposure to News
Valenzuela S, Bachmann I, Aguilar M. Socialized for News Media Use: How Family Communication, Information-Processing Needs, and Gratifications Determine Adolescents’ Exposure to News. Communication Research. 2019;46(8):1095-1118. doi:10.1177/0093650215623833

Adolescence is a key period in the development of individuals’ news habits, but little is known about the processes involved in the process of news media socialization. This study proposes an integrated model in which the influence of family communication on motivations and behaviors of adolescents in relation to news consumption occurs through the development of personality traits related to information processing (namely, need for cognition and need to evaluate). Structural equation modeling of data from a representative survey of 2,273 adolescents, aged 13 to 17, provide support for the theorized model, such that concept-oriented communication within families is associated to news exposure indirectly, via personality traits and motivations. Thus, the study provides an initial assessment of one way children are socialized to become news enthusiasts and news avoiders. It also provides empirical evidence that information-processing traits are influenced by family communication patterns, confirming what hitherto was theoretical speculation.

Improved Compressed String Dictionaries
Nieves R. Brisaboa, Ana Cerdeira-Pena, Guillermo de Bernardo, and Gonzalo Navarro. 2019. Improved Compressed String Dictionaries. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM '19). Association for Computing Machinery, New York, NY, USA, 29–38. DOI:

We introduce a new family of compressed data structures to efficiently store and query large string dictionaries in main memory. Our main technique is a combination of hierarchical Front-coding with ideas from longest-common-prefix computation in suffix arrays. Our data structures yield relevant space-time tradeoffs in real-world dictionaries. We focus on two domains where string dictionaries are extensively used and efficient compression is required: URL collections, a key element in Web graphs and applications such as Web mining; and collections of URIs and literals, the basic components of RDF datasets. Our experiments show that our data structures achieve better compression than the state-of-the-art alternatives while providing very competitive query times.


Motif-driven Retrieval of Greek Painted Pottery
Lengauer, S., Komar, A., Labrada, A., Karl, S., Trinkl, E., Preiner, R., ... Schreck, T. (2019). Motif-driven Retrieval of Greek Painted Pottery. In S. Rizvic, & K. Rodriguez Echavarria (Eds.), Eurographics Workshop on Graphics and Cultural Heritage (pp. 91-98). Eurographics - European Association for Computer Graphics.

The analysis of painted pottery is instrumental for understanding ancient Greek society and human behavior of past cultures in Archaeology. A key part of this analysis is the discovery of cross references to establish links and correspondences. However, due to the vast amount of documented images and 3D scans of pottery objects in today’s domain repositories, manual search is very time consuming. Computer aided retrieval methods are of increasing importance. Mostly, current retrieval systems for this kind of cultural heritage data only allow to search for pottery of similar vessel’s shape. However, in many cases important similarity cues are given by motifs painted on these vessels. We present an interactive retrieval system that makes use of this information to allow for a motif-driven search in cultural heritage repositories. We address the problem of unsupervised motif extraction for preprocessing and the shape-based similarity search for Greek painted pottery. Our experimental evaluation on relevant repository data demonstrates effectiveness of our approach on examples of different motifs of interests.

Covering a Set of Points with k Bounding Boxes
C. S. Sepúlveda, A. Rodríguez and D. Seco, "Covering a Set of Points with k Bounding Boxes," 2019 38th International Conference of the Chilean Computer Science Society (SCCC), Concepcion, Chile, 2019, pp. 1-6, DOI:

Covering a set of points with k orthogonal bounding boxes is useful for implementing Spatio-temporal index structures that are built from a given dataset. In this work, we deal with the problem of covering a set of points with k-parallel axis boxes, under the restriction that the total area enclosed by the boxes must be minimized. To achieve this, we present a novel algorithm that, using dynamic programming techniques, finds the optimal solution for covering a set of points with k-bounding boxes where the total sum of the areas of the boxes is minimum. This is compared with the process of generating k-bounding boxes every l units of distance, achieving an improvement of about 50% of the unuseful area covered


Compressed Data Structures for Astronomical Content-Aware Resource Search
M. Araya, D. Arroyuelo, C. Saldías, and M. Solar, "Compressed Data Structures for Astronomical Content-Aware Resource Search," 2019 38th International Conference of the Chilean Computer Science Society (SCCC), Concepcion, Chile, 2019, pp. 1-8, doi:

We introduce an efficient approach that aims at supporting content-based queries on the Chilean Virtual Observatory. In particular we are interested in retrieving relevant information from virtual-observatory tables. This introduces several challenges that make the information-retrieval process harder. We define an algorithm that uses a compressed data structure to obtain the count of the number of occurrences of a string query within each column of a table. This kind of query has been used in the literature for faceted and semantic search as well as for retrieving information from web tables. This is in order to improve search effectiveness. We show that using only 15%-25% the space of a table our approach contains the table data (and hence the table can be deleted) and is able to answer queries efficiently in a few milliseconds.


A Compact Rank/Select Data Structure for the Streaming Model
N. González and D. Arroyuelo, "A Compact Rank/Select Data Structure for the Streaming Model," 2019 38th International Conference of the Chilean Computer Science Society (SCCC), Concepcion, Chile, 2019, pp. 1-7, doi:

For a sorted set S from a universe [1..u] received under the streaming model (i.e., elements are received one at a time, in sorted order), such that at a given time it contains n elements {x 1 , . . . , x n }, and whose characteristic bit vector is C S = 0(σ 1 )11···10(σ 2 )11···1 · · · 0(σ g )11···1 (i.e., the set elements are actually arranged in g <; n intervals of size ≥ 1), we propose a compact data structure that answers operations select and rank in Θ(lg(g/ lg g)) worst-case time, and append in O(1) amortized time, using 2g lg u-n/g +g lg n/g +o(g lg lg g) bits of space. The structure is suitable in cases where g ≤ n/2.


Fine-Grained Evaluation for Entity Linking
Henry Rosales-Méndez, Aidan Hogan, Barbara Poblete. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). November 2019. Pages: 718–727. DOI: 10.18653/v1/D19-1066.

The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with an unambiguous identifier in a Knowledge Base. While much work has been done on the topic, we first present the results of a survey that reveal a lack of consensus in the community regarding what forms of mentions in a text and what forms of links the EL task should consider. We argue that no one definition of the Entity Linking task fits all, and rather propose a fine-grained categorization of different types of entity mentions and links. We then re-annotate three EL benchmark datasets – ACE2004, KORE50, and VoxEL – with respect to these categories. We propose a fuzzy recall metric to address the lack of consensus and conclude with fine-grained evaluation results comparing a selection of online EL systems.

CompactNets: Compact Hierarchical Compositional Networks for Visual Recognition
Hans Lobel, René Vidal, Alvaro Soto. Computer Vision and Image Understanding. Volume 191, February 2020, 102841. Available online October 28, 2019.

CNN-based models currently provide state-of-the-art performance in image categorization tasks. While these methods are powerful in terms of representational capacity, they are generally not conceived with explicit means to control complexity. This might lead to scenarios where resources are used in a non-optimal manner, increasing the number of unspecialized or repeated neurons, and overfitting to data. In this work we propose CompactNets, a new approach to visual recognition that learns a hierarchy of shared, discriminative, specialized, and compact representations. CompactNets naturally capture the notion of compositional compactness, a characterization of complexity in compositional models, consisting on using the smallest number of patterns to build a suitable visual representation. We employ a structural regularizer with group-sparse terms in the objective function, that induces on each layer, an efficient and effective use of elements from the layer below. In particular, this allows groups of top-level features to be specialized based on category information. We evaluate CompactNets on the ILSVRC12 dataset, obtaining compact representations and competitive performance, using an order of magnitude less parameters than common CNN-based approaches. We show that CompactNets are able to outperform other group-sparse-based approaches, in terms of performance and compactness. Finally, transfer-learning experiments on small-scale datasets demonstrate high generalization power, providing remarkable categorization performance with respect to alternative approaches.

SHACL2SPARQL: Validating a SPARQL Endpoint against Recursive SHACL Constraints
Julien Corman, Fernando Florenzano, Juan L. Reutter, Ognjen Savkovic. Free University of Bozen-Bolzano, Bolzano, Italy; PUC Chile and IMFD Chile. Proceedings of the ISWC 2019 Satellite Tracks, 18th International Semantic Web Conference (ISWC, October 26-30, 2019),

The article presents SHACL2SPARQL, a tool that validates an RDF graph stored as a SPARQL endpoint against possibly recursive SHACL constraints. It is based on the algorithm proposed in [3]. This implementation improves upon the original algorithm with a wider range of natively supported constraint operators, SPARQL query optimization techniques, and a mechanism to explain invalid targets.


Estimating the Dynamics of SPARQL Query Results Using Binary Classification
Alberto Moya Loustaunau and Aidan Hogan. IMFD; DCC, University of Chile. CEUR Workshop Proceedings, Vol. 2496. October 26-30, 2019.

We address the problem of estimating when the results of an input SPARQL query over dynamic RDF datasets will change. We evaluate a framework that extracts features from the query and/or from past versions of the target dataset and inputs them into binary classifiers to predict whether or not the results for a query will change at a fixed point in the near future. For this evaluation, we create a gold standard based on 23 versions of Wikidata and a curated collection of 221 SPARQL queries. Our results show that the quality of predictions possible using (only) features based on the query structure and lightweight statistics of the predicate dynamics – though capable of beating a random baseline – are not competitive with results obtained using (more costly to derive) knowledge of the complete historical changes in the query results.


Distributed clustering of text collections
J. Zamora, H. Allende-Cid and M. Mendoza, "Distributed Clustering of Text Collections," in IEEE Access, vol. 7, pp. 155671-155685, 2019, doi: 10.1109/ACCESS.2019.2949455

Current data processing tasks require efficient approaches capable of dealing with large databases. A promising strategy consists in distributing the data along with several computers that partially solve the undertaken problem. Finally, these partial answers are integrated to obtain a final solution. We introduce distributed shared nearest neighbors (D-SNN), a novel clustering algorithm that work with disjoint partitions of data. Our algorithm produces a global clustering solution that achieves a competitive performance regarding centralized approaches. The algorithm works effectively with high dimensional data, being advisable for document clustering tasks. Experimental results over five data sets show that our proposal is competitive in terms of quality performance measures when compared to state of the art methods.

Applying Self-attention for Stance Classification
Bugueño M., Mendoza M. (2019) Applying Self-attention for Stance Classification. In: Nyström I., Hernández Heredia Y., Milián Núñez V. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2019. Lecture Notes in Computer Science, vol 11896. Springer, Cham.

Stance classification is the task of automatically identify the user’s positions about a specific topic. The classification of stance may help to understand how people react to a piece of target information, a task that is interesting in different areas as advertising campaigns, brand analytics, and fake news detection, among others. The rise of social media has put into the focus of this task the classification of stance in online social networks. A number of methods have been designed for this purpose showing that this problem is hard and challenging. In this work, we explore how to use self-attention models for stance classification. Instead of using attention mechanisms to learn directly from the text we use self-attention to combine different baselines’ outputs. For a given post, we use the transformer architecture to encode each baseline output exploiting relationships between baselines and posts. Then, the transformer learns how to combine the outputs of these methods reaching a consistently better classification than the ones provided by the baselines. We conclude that self-attention models are helpful to learn from baselines’ outputs in a stance classification task.

Path Queries on Functions
Gagie, T., He, M., & Navarro, G. (2019). Path queries on functions. Theoretical Computer Science, 770, 34–50. DOI

Let f:[1..n]→[1..n] be a function, and ℓ:[1..n]→[1..σ] indicate a label assigned to each element of the domain. We design several compact data structures that answer various kinds of summary queries on the labels of paths in f. For example, we can find either the minimum label in fk(i) for a given i and any k≥0 in a given range [k1..k2], or the minimum label in f−k(i) for a given i and k>0, using nlg⁡n+nlg⁡σ+o(nlg⁡n) bits and time O(α(n)), the inverse Ackermann function. Within similar space we can count, in time O(lg⁡n/lg⁡lg⁡n), the number of labels within a range, and report each element with such labels in O(lg⁡n/lg⁡lg⁡n) additional time. Several other tradeoffs and possible queries are considered, such as selection, top-r queries and τ-majorities. Finally, we consider queries that allow us navigate on the graph of the function, such as the nearest common successor of two elements, or the nearest successor or predecessor of an element within a range of labels.


Validating SHACL Constraints over a SPARQL Endpoint
Corman J., Florenzano F., Reutter J.L., Savković O. (2019) Validating Shacl Constraints over a Sparql Endpoint. In: Ghidini C. et al. (eds) The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science, vol 11778. Springer, Cham.

SHACL (Shapes Constraint Language) is a specification for describing and validating RDF graphs that has recently become a W3C recommendation. While the language is gaining traction in the industry, algorithms for SHACL constraint validation are still at an early stage. A first challenge comes from the fact that RDF graphs are often exposed as SPARQL endpoints, and therefore only accessible via queries. Another difficulty is the absence of guidelines about the way recursive constraints should be handled. In this paper, we provide algorithms for validating a graph against a SHACL schema, which can be executed over a SPARQL endpoint. We first investigate the possibility of validating a graph through a single query for non-recursive constraints. Then for the recursive case, since the problem has been shown to be NP-hard, we propose a strategy that consists in evaluating a small number of SPARQL queries over the endpoint, and using the answers to build a set of propositional formulas that are passed to a SAT solver. Finally, we show that the process can be optimized when dealing with recursive but tractable fragments of SHACL, without the need for an external solver. We also present a proof-of-concept evaluation of this last approach.


BTC-2019: The 2019 Billion Triple Challenge Dataset
Herrera JM., Hogan A., Käfer T. (2019) BTC-2019: The 2019 Billion Triple Challenge Dataset. In: Ghidini C. et al. (eds) The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science, vol 11779. Springer, Cham.

Six datasets have been published under the title of Billion Triple Challenge (BTC) since 2008. Each such dataset contains billions of triples extracted from millions of documents crawed from hundreds of domains. While these datasets were originally motivated by the annual ISWC competition from which they take their name, they would become widely used in other contexts, forming a key resource for a variety of research works concerned with managing and/or analysing diverse, real-world RDF data as found natively on the Web. Given that the last BTC dataset was published in 2014, we prepare and publish a new version – BTC-2019 – containing 2.2 billion quads parsed from 2.6 million documents on 394 pay-level-domains. This paper first motivates the BTC datasets with a survey of research works using these datasets. Next we provide details of how the BTC-2019 crawl was configured. We then present and discuss a variety of statistics that aim to gain insights into the content of BTC-2019. We discuss the hosting of the dataset and the ways in which it can be accessed, remixed and used.



RDF Explorer: A Visual SPARQL Query Builder
Vargas H., Buil-Aranda C., Hogan A., López C. (2019) RDF Explorer: A Visual SPARQL Query Builder. In: Ghidini C. et al. (eds) The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science, vol 11778. Springer, Cham. DOI

Despite the growing popularity of knowledge graphs for managing diverse data at large scale, users who wish to pose expressive queries against such graphs are often expected to know (i) how to formulate queries in a language such as SPARQL, and (ii) how entities of interest are described in the graph. In this paper we propose a language that relaxes these expectations; the language’s operators are based on an interactive graph-based exploration that allows non-expert users to simultaneously navigate and query knowledge graphs; we compare the expressivity of this language with SPARQL. We then discuss an implementation of this language that we call RDF Explorer and discuss various desirable properties it has, such as avoiding interactions that lead to empty results. Through a user study over the Wikidata knowledge-graph, we show that users successfully complete more tasks with RDF Explorer than with the existing Wikidata Query Helper, while a usability questionnaire demonstrates that users generally prefer our tool and self-report lower levels of frustration and mental effort.


A Worst-Case Optimal Join Algorithm for SPARQL
Hogan A., Riveros C., Rojas C., Soto A. (2019) A Worst-Case Optimal Join Algorithm for SPARQL. In: Ghidini C. et al. (eds) The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science, vol 11778. Springer, Cham. DOI

Worst-case optimal multiway join algorithms have recently gained a lot of attention in the database literature. These algorithms not only offer strong theoretical guarantees of efficiency but have also been empirically demonstrated to significantly improve query runtimes for relational and graph databases. Despite these promising theoretical and practical results, however, the Semantic Web community has yet to adopt such techniques; to the best of our knowledge, no native RDF database currently supports such join algorithms, wherein this paper we demonstrate that this should change. We propose a novel procedure for evaluating SPARQL queries based on an existing worst-case join algorithm called Leapfrog Triejoin. We propose an adaptation of this algorithm for evaluating SPARQL queries and implement it in Apache Jena. We then present experiments over the Berlin and WatDiv SPARQL benchmarks, and a novel benchmark that we propose based on Wikidata that is designed to provide insights into join performance for a more diverse set of basic graph patterns. Our results show that with this new join algorithm, Apache Jena often runs orders of magnitude faster than the base version and two other SPARQL engines: Virtuoso and Blazegraph.



A Call to Contextualize Public Opinion-Based Research in Political Communication
Hernando Rojas & Sebastián Valenzuela (2019) A Call to Contextualize Public Opinion-Based Research in Political Communication, Political Communication, 36:4, 652-659, DOI: 10.1080/10584609.2019.1670897

For those of us who regularly conduct public opinion research outside of the United States and Europe, it is customary to have to explain whether our findings are “real,” that is, generalizable relationships that advance theory, or some kind of contextual artifact. Infamous Reviewer 2 will ask for an explanation of how context might be affecting the relationships that we are describing, and while it might be irritating to do so, in this case, Reviewer 2 is right. The issue of course is not having to explain how contexts matter, but instead why scholarship examining the US, or certain western countries, is not consistently subject to the same task. In this piece, we advocate for contextualizing public opinion research in all cases, because of course relationships among variables are always context/historic dependent. Rather than being a theoretical shortcoming, we argue that this becomes a theoretical strength: being able to identify the conditions under which our proposed relationships hold and those in which they do not. So, for example, rather than just saying that news consumption is positively related with political participation, as scholars including ourselves have been doing for years, we need to make explicit the news construction conditions under which this is the case, the participatory repertoire being considered, and normative implications of our claims. To engage contextualization, cross-national, cross cultural, cross group and historical comparisons are particularly useful. To build our case for the increasing need for contextualization in political communication research, we will first examine some early comparative research, then we will show some current problematic comparisons, and finally will end with some concluding remarks of the challenges for the field that lie ahead and the benefits of a contextual approach.


Implementing the Topological Model Succinctly
Fuentes-Sepúlveda J., Navarro G., Seco D. (2019) Implementing the Topological Model Succinctly. In: Brisaboa N., Puglisi S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science, vol 11811. Springer, Cham. DOI

We show that the topological model, a semantically rich standard to represent GIS data, can be encoded succinctly while efficiently answering a number of topology-related queries. We build on recent succinct planar graph representations so as to encode a model with m edges within 4m+o(m)4m+o(m) bits and answer various queries relating nodes, edges, and faces in o(log log m)o(log ⁡log ⁡m) time, or any time in ω(logm)ω(log⁡m) for a few complex ones.


A Practical Alphabet-Partitioning Rank/Select Data Structure
Arroyuelo D., Sepúlveda E. (2019) A Practical Alphabet-Partitioning Rank/Select Data Structure. In: Brisaboa N., Puglisi S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science, vol 11811. Springer, Cham.

This paper proposes a practical implementation of an alphabet-partitioning compressed data structure, which represents a string within compressed space and supports the fundamental operations rank and select efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach (which is one of the most efficient approaches in practice). In particular, the time for operation select can be reduced by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), and the distributed-computation processing of rank and select operations. As far as we know, this is the first study about the support of rank/select operations on a distributed-computing environme03


Adaptive Succinctness
Arroyuelo D., Raman R. (2019) Adaptive Succinctness. In: Brisaboa N., Puglisi S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science, vol 11811. Springer, Cham

Representing a static set of integers S|S|=n|S|=n from a finite universe U=[1..u]U=[1..u] is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of rank and select on S; if S is viewed as its characteristic vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures.

Although there is an information-theoretic lower bound of B(n,u)=lg(un)B(n,u)=lg⁡(un) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain non-trivial runs of consecutive elements, one that occurs in many practical situations.

Let CnCn denote the class of (un)(un) distinct sets of nn elements over the universe [1..u][1..u]. Let also CngCnCgn⊂Cn contain the sets whose nn elements are arranged in gng≤n runs of i1ℓi≥1 consecutive element from U for i=1,,gi=1,…,g, and let Cng,rCngCg,rn⊂Cgn contain all sets that consist of g runs, such that rgr≤g of them have at least 2 elements.

  • We introduce new compressibility measures for sets, including:

    • L1=lg|Cng|=lg(un+1g)+lg(n1g1)L1=lg⁡|Cgn|=lg⁡(u−n+1g)+lg⁡(n−1g−1) and

    • L2=lg|Cng,r|=lg(un+1g)+lg(ng1r1)+lg(gr)L2=lg⁡|Cg,rn|=lg⁡(u−n+1g)+lg⁡(n−g−1r−1)+lg⁡(gr)

    We show that L2L1B(n,u)L2≤L1≤B(n,u).

  • We give data structures that use space close to bounds L1L1 and L2L2 and support rank and select in O(1) time.

  • We provide additional measures involving entropy-coding run lengths and gaps between items, data structures to support these measures, and show experimentally that these approaches are promising for real-world datasets.


Faster Dynamic Compressed d-ary Relations
Arroyuelo D., de Bernardo G., Gagie T., Navarro G. (2019) Faster Dynamic Compressed d-ary Relations. In: Brisaboa N., Puglisi S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science, vol 11811. Springer, Cham.

The 𝑘²-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to d dimensions, where it is called a 𝑘𝑑-tree. The representation boils down to a long bitvector. We show that interpreting the 𝑘𝑑-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector as done in previous work, yields operation times that are below the lower bound of dynamic bitvectors and offers improved time performance in practice.


Populist Multiculturalism in the Andes: Balancing Political Control and Societal Autonomy
Alberti, Carla. Comparative Politics, Volume 52, Number 1, October 2019, pp. 43-63(21). City University of New York.

Radical populists in the Andes have combined a populist program and a multicultural agenda. However, while populism centralizes power in the hands of the leader and emphasizes the unity of the people, multiculturalism grants cultural rights that strengthen societal autonomy, generating an inherent tension between these two modes of incorporation. How are populist governments able to combine unity and fragmentation as well as centralization and autonomy? This article develops the concept of populist multiculturalism, focusing on the Movimiento al Socialismo (MAS) in Bolivia, which has supported autonomy rights while simultaneously curtailing their implementation. Specifically, it examines the implementation of indigenous autonomous governments and prior consultation and the relationship between indigenous organizations and the ruling party. The article also extends this concept to Ecuador and Venezuela.


The semantic network of the Spanish dictionary during the last century: Structural stability and resilience
Camilo Garrido; Claudio Gutierrez; Guillermo Soto. University of Chile. Proceedings of Electronic Lexicography in the 21st Century Conference. October 2019, 831-848.

The semantic network of a dictionary is a mathematical structure that represents relationships among words of a language. In this work, we study the evolution of the semantic network of the Spanish dictionary during the last century, beginning in 1925 until 2014. We analysed the permanence and changes of its structural properties, such as size of components, average shortest path length, and degree distribution. We found that global structural properties of the Spanish dictionary network are remarkably stable. In fact, if we remove all the labels from the network, networks from different editions of the Spanish dictionary are practically indistinguishable. On the other hand, local properties change over the years offering insights about the evolution of lexicon. For instance, the neighbourhood of a single word or the shared neighbourhood between a pair of words. This paper presents preliminary evidence that dictionary networks are an interesting language tool and good proxies to study semantic clouds of words and their evolution in a given language.


How Party Activism Survives. Uruguay´s Frente Amplio
Pérez Bentancur, V., Piñeiro Rodríguez, R., & Rosenblatt, F. (2019). How Party Activism Survives: Uruguay's Frente Amplio. Cambridge: Cambridge University Press. DOI

Political parties with activists are in decline due to various external shocks. Societal changes, like the emergence of new technologies of communication, have diminished the role and number of activists, while party elites increasingly can make do without grassroots activists. However, recent scholarship concerning different democracies has shown how activism still matters for representation. This book contributes to this literature by analyzing the unique case of the Uruguayan Frente Amplio (FA), the only mass-organic, institutionalized leftist party in Latin America. Using thick description, systematic process tracing, and survey research, this case study highlights the value of an organization-centered approach for understanding parties’ role in democracy. Within the FA, organizational rules grant activists a significant voice, which imbues activists’ participation with a strong sense of efficacy. This book is an excellent resource for scholars and students of Latin America and comparative politics who are interested in political parties and the challenges confronting new democracies.


Arabic sentiment analysis: Studies, resources, and tools
Guellil, I., Azouaou, F. & Mendoza, M. Arabic sentiment analysis: studies, resources, and tools. Soc. Netw. Anal. Min. 9, 56 (2019).

To determine whether a document or a sentence expresses a positive or negative sentiment, three main approaches are commonly used: the lexicon-based approach, corpus-based approach, and a hybrid approach. The study of sentiment analysis in English has the highest number of sentiment analysis studies, while research is more limited for other languages, including Arabic and its dialects. Lexicon based approaches need annotated sentiment lexicons (containing the valence and intensity of its terms and expressions). Corpus-based sentiment analysis requires annotated sentences. One of the significant problems related to the treatment of Arabic and its dialects is the lack of these resources. We present in this survey the most recent resources and advances that have been done for Arabic sentiment analysis. This survey presents recent work (where the majority of these works are between 2015 and 2019). These works are classified by category (survey work or contribution work). For contribution work, we focus on the construction of sentiment lexicon and corpus. We also describe emergent trends related to Arabic sentiment analysis, principally associated with the use of deep learning techniques.

A Simple Data Structure for Optimal Two-Sided 2D Orthogonal Range Queries
Grez A., Calí A., Ugarte M. (2019) A Simple Data Structure for Optimal Two-Sided 2D Orthogonal Range Queries. In: Cuzzocrea A., Greco S., Larsen H., Saccà D., Andreasen T., Christiansen H. (eds) Flexible Query Answering Systems. FQAS 2019. Lecture Notes in Computer Science, vol 11529. Springer, Cham.

Given an arbitrary set A of two-dimensional points over a totally-ordered domain, a two-sided planar range query consists on finding all points of A within an arbitrary quadrant. In this paper we present a novel data structure that uses linear space in |A| while allowing for two-dimensional orthogonal range queries with logarithmic pre-processing and constant-delay enumeration.

Data mining for item recommendation in MOBA games
Vladimir Araujo, Felipe Rios, and Denis Parra. 2019. Data mining for item recommendation in MOBA games. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys '19). Association for Computing Machinery, New York, NY, USA, 393–397. DOI:

E-Sports has been positioned as an important activity within MOBA (Multiplayer Online Battle Arena) games in recent years. There is existing research on recommender systems in this topic, but most of it focuses on the character recommendation problem. However, the recommendation of items is also challenging because of its contextual nature, depending on the other characters. We have developed a framework that suggests items for a character based on the match context. The system aims to help players who have recently started the game as well as frequent players to take strategic advantage during a match and to improve their purchasing decision making. By analyzing a dataset of ranked matches through data mining techniques, we can capture purchase dynamic of experienced players to use it to generate recommendations. The results show that our proposed solution yields up to 80% of mAP, suggesting that the method leverages context information successfully. These results, together with open issues we mention in the paper, call for further research in the area.

Clustering Approaches for Top-k Recommender Systems
Nicolás Torres and Marcelo Mendoza. International Journal on Artificial Intelligence Tools. Vol. 28, No. 05, 1950019 (2019)

Clustering-based recommender systems bound the seek of similar users within small user clusters providing fast recommendations in large-scale datasets. Then groups can naturally be distributed into different data partitions scaling up in the number of users the recommender system can handle. Unfortunately, while the number of users and items included in a cluster solution increases, the performance in terms of precision of a clustering-based recommender system decreases. We present a novel approach that introduces a cluster-based distance function used for neighborhood computation. In our approach, clusters generated from the training data provide the basis for neighborhood selection. Then, to expand the search of relevant users, we use a novel measure that can exploit the global cluster structure to infer cluster-outside user’s distances. Empirical studies on five widely known benchmark datasets show that our proposal is very competitive in terms of precision, recall, and NDCG. However, the strongest point of our method relies on scalability, reaching speedups of 20× in a sequential computing evaluation framework and up to 100× in a parallel architecture. These results show that an efficient implementation of our cluster-based CF method can handle very large datasets providing also good results in terms of precision, avoiding the high computational costs involved in the application of more sophisticated techniques.

WekaDeeplearning4j: A deep learning package for Weka based on Deeplearning4j
Steven Lang, Felipe Bravo-Marquez, Christopher Beckham, Mark Hall, Eibe Frank. Knowledge-Based Systems. Volume 178, 15 August 2019, Pages 48-50.

Deep learning is a branch of machine learning that generates multi-layered representations of data, commonly using artificial neural networks, and has improved the state-of-the-art in various machine learning tasks (e.g., image classification, object detection, speech recognition, and document classification). However, most popular deep learning frameworks such as TensorFlow and PyTorch require users to write code to apply deep learning. We present WekaDeeplearning4j, a Weka package that makes deep learning accessible through a graphical user interface (GUI). The package uses Deeplearning4j as its backend, provides GPU support, and enables GUI-based training of deep neural networks such as convolutional and recurrent neural networks. It also provides pre-processing functionality for image and text data.

Adaptation to extreme environments in an admixed human population from the Atacama Desert
Lucas Vicuña, Mario I Fernandez, Cecilia Vial, Patricio Valdebenito, Eduardo Chaparro, Karena Espinoza, Annemarie Ziegler, Alberto Bustamante, Susana Eyheramendy, Adaptation to Extreme Environments in an Admixed Human Population from the Atacama Desert, Genome Biology and Evolution, Volume 11, Issue 9, September 2019, Pages 2468–2479,

Inorganic arsenic (As) is a toxic xenobiotic and carcinogen associated with severe health conditions. The urban population from the Atacama Desert in northern Chile was exposed to extremely high As levels (up to 600 µg/l) in drinking water between 1958 and 1971, leading to increased incidence of urinary bladder cancer (BC), skin cancer, kidney cancer, and coronary thrombosis decades later. Besides, the Andean Native-American ancestors of the Atacama population were previously exposed for millennia to elevated As levels in water (∼120 µg/l) for at least 5,000 years, suggesting adaptation to this selective pressure. Here, we performed two genome-wide selection tests—PBSn1 and an ancestry-enrichment test—in an admixed population from Atacama, to identify adaptation signatures to As exposure acquired before and after admixture with Europeans, respectively. The top second variant selected by PBSn1 was associated with LCE4A-C1orf68, a gene that may be involved in the immune barrier of the epithelium during BC. We performed association tests between the top PBSn1 hits and BC occurrence in our population. The strongest association (P =0.012) was achieved by the LCE4A-C1orf68 variant. The ancestry-enrichment test detected highly significant signals (P =1.3 × 10−9) mapping MAK16, a gene with important roles in ribosome biogenesis during the G1 phase of the cell cycle. Our results contribute to a better understanding of the genetic factors involved in adaptation to the pathophysiological consequences of exposure.


Extending general compact querieable representations to GIS applications
Brisaboa, N. R., Cerdeira-Pena, A., de Bernardo, G., Navarro, G., & Pedreira, Ó. (2020). Extending general compact queriable representations to GIS applications. Information Sciences, 506, 196–216. DOI

The raster model is commonly used for the representation of images in many domains and is especially useful in Geographic Information Systems (GIS) to store information about continuous variables of the space (elevation, temperature, etc.). Current representations of raster data are usually designed for external memory or, when stored in main memory, lack efficient query capabilities. In this paper, we propose compact representations to efficiently store and query raster datasets in the main memory. We present different representations for binary raster data, general raster data, and time-evolving raster data. We experimentally compare our proposals with traditional storage mechanisms such as linear quadtrees or compressed GeoTIFF files. Results show that our structures are up to 10 times smaller than classical linear quadtrees, and even comparable in space to non-queriable representations of raster data, while efficiently answering a number of typical queries.


Taming the digital information tide to promote equality
Valenzuela, S., Rojas, H. Taming the digital information tide to promote equality. Nat Hum Behav 3, 1134–1136 (2019).

Interactive technologies are changing the ways we learn facts, develop attitudes and participate in politics, with the ensuing risk of increasing pre-existing inequalities. Addressing this challenge is the duty of researchers, technology companies, governments and news organizations.

Getting Prepared to Be Prepared: How Interpersonal Skills Aid Fieldwork in Challenging Contexts
Alberti, C., & Jenne, N. (2019). Getting Prepared to Be Prepared: How Interpersonal Skills Aid Fieldwork in Challenging Contexts. Qualitative Sociology Review, 15(3), 42-62.

This article deals with fieldwork in challenging research contexts that make preparation for field research particularly difficult. Challenging contexts include generally insecure places, politicized contexts, and unknown settings. Drawing on our experience in the field, we discuss four challenges that are common across these contexts: access, positionality, researcher well-being, and research design, and data collection. Bringing together insights from fieldwork with urban elites and in the countryside, this paper describes problems that occurred in both settings and identifies a set of interpersonal skills that helped the authors to tackle the challenges of the field and seize the opportunities it offered. This article posits that recognizing the importance of certain interpersonal skills, namely: openness, empathy, humility, and flexibility, precedes the identification of practical tools. Interpersonal skills, instead, focus on a general attitude that underlies researchers’ capacity to make informed choices about specific courses of action, preparing fieldworkers to be prepared to confront problems once they arise.


An ELMo-inspired approach to SemDeep-5’s Word-in-Context task
Alan Ansell, Felipe Bravo-Marquez, Bernhard Pfahringer. Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5), August 2019. Pages: 21–25.

This paper describes a submission to the Word-in-Context competition for the IJCAI 2019 SemDeep-5 workshop. The task is to determine whether a given focus word is used in the same or different senses in two contexts. We took an ELMo-inspired approach similar to the baseline model in the task description paper, where contextualized representations are obtained for the focus words and a classification is made according to the degree of similarity between these representations. Our model had a few simple differences, notably joint training of the forward and backward LSTMs, a different choice of states for the contextualized representations and a new similarity measure for them. These changes yielded a 3.5% improvement on the ELMo baseline.

Analyzing the Design Space for Visualizing Neural Attention in Text Classification
Parra, D., Valdivieso, H., Carvallo, A., Rada, G., Verbert, K., & Schreck, T. (2019). Analyzing the Design Space for Visualizing Neural Attention in Text Classification. In Proc. IEEE VIS Workshop on Vis X AI: 2nd Workshop on Visualization for AI Explainability (VISxAI).

Introduction: Deep Neural Networks (DNNs) are a type of machine learning model (Goodfellow et al, 2016) which have reported state-of-the-art results in several tasks in the past years. Despite the impressive results reported by these models in several fields such as computer vision (Krizhevsky et al., 2014), natural language processing (Mikolov et al., 2013) or recommender systems (Covington et al., 2016), one of their biggest drawbacks is their lack of interpretability and transparency. Some of the best performing DNN models have millions of parameters, so making sense of what these models learn is an active research challenge. These algorithms can help to solve and automate difficult and expensive tasks, but their adoption in critical domains, which usually requires liability, depends on making their decision interpretable by humans. Some large funding programs such as DARPA XAI (Gunning and Aha, 2019) are addressing this problem, providing evidence of their importance. On the other side, recent legislation such as Europe’s GDPR gives people the right to explainability of automated decisions regarding their private data.

One of the most significant techniques introduced to DNNs in the latest years is the so called attention mechanism (Larrochelle and Hinton, 2010). The idea is inspired by our visual system, since humans focus selectively on parts rather than on a whole image, combining information from several fixations to form the full scene (Mnih et al, 2014). This mechanism allows the network to focus on a subset of inputs or parameters when trained on a task. Attention has improved the performance of these models, and it has also given them a chance to be more explainable. Inspecting what the model is paying attention to helps to make the model accountable in tasks such as image classification, document classification or automatic image captioning. Despite this potential, researchers in the area of machine learning usually use the traditional visualization idioms available in software packages, rather than studying all the options for visual encodings to represent models, results or parameters more effectively. We see a chance of using design principles from information visualization in order to improve the way that neural attention models are visually presented.

This article focuses on the design space to analyze, inspect and understand what neural attention models are learning. In particular, we aim at contributing to the field of Explainable Artificial Intelligence (XAI), by describing the potential design space as well as informed decisions to take into account when presenting the results of neural networks using the attention mechanism. We also propose some initial ideas with a use case: classification of biomedical documents.

Comparing Word Embeddings for Document Screening based on Active Learning
Carvallo, Andres and Denis Parra. “Comparing Word Embeddings for Document Screening based on Active Learning.” CEUR Workshop Proceedings. BIRNDL@SIGIR (July 25, 2019).

Document screening is a fundamental task within Evidencebased Medicine (EBM), a practice that provides scientific evidence to support medical decisions. Several approaches are attempting to reduce the workload of physicians who need to screen and label hundreds or thousands of documents in order to answer specific clinical questions. Previous works have attempted to semi-automate document screening, reporting promising results, but their evaluation is conducted using small datasets, which hinders generalization. Moreover, some recent works have used recently introduced neural language models, but no previous work have compared, for this task, the performance of different language models based on neural word embeddings, which have reported good results in the latest years for several NLP tasks. In this work, we evaluate the performance of two popular neural word embeddings (Word2vec and GloVe) in an active learning-based setting for document screening in EBM, with the goal of reducing the number of documents that physicians need to label in order to answer clinical questions. We evaluate these methods in a small public dataset (HealthCLEF 2017) as well as a larger one (Epistemonikos). Our experiments indicate that Word2vec have less variance and better general performance than GloVe when using active learning strategies based on uncertainty sampling.


A Meta-Analysis of the Effects of Cross-Cutting Exposure on Political Participation
Jörg Matthes, Johannes Knoll, Sebastián Valenzuela, David Nicolas Hopmann & Christian Von Sikorski (2019) A Meta-Analysis of the Effects of Cross-Cutting Exposure on Political Participation, Political Communication, 36:4, 523-542, DOI: 10.1080/10584609.2019.1619638

Scholars have advanced many theoretical explanations for expecting a negative or positive relationship between individuals’ cross-cutting exposure—either through interpersonal or mediated forms of communication—and their political participation. However, whether cross-cutting exposure is a positive or negative predictor of participation is still an unsettled question. To help fill this gap, we conducted a meta-analysis of 48 empirical studies comprising more than 70,000 participants examining the association between cross-cutting exposure and political participation. The meta-analysis produced two main findings. First, it shows that, over all studies, there is no significant relationship, r = .002, Zr = .002 (95% CI = −.04 to .05). Second, the null relationship cannot be explained by variations in the characteristics of cross-cutting environments (e.g., topic, place, or source of exposure), participation outcomes (e.g., online vs. offline activities), or methods employed (e.g., experiment vs. survey). Taken together, these results should alleviate concerns about negative effects of cross-cutting exposure on political engagement. Implications for future research are discussed.

Knowledge Discovery from News Events on Social Media
Mauricio Quezada, Department of Computer Science, Universidad de Chile; Millenium Institute for Foundational Research on Data, Santiago, Chile. FDIA 2019, JULY 2019. CEUR Workshop Proceedings Vol-2537.

Online activity involves the consumption and production of event-related content. There are about 500 million Twitter messages published every day, and according to surveys, 59% of its users use the platform as a way to get the news. Its high rate of production of multimodal content (text, images, and videos) necessitates having flexible models to understand the dynamics of the information disseminated on social media. This thesis proposes the creation of context models from usergenerated messages on Twitter to discover knowledge as a way to perform high-level quantitative analysis of news events. These models are useful in three perspectives: the spatio-temporal context in which the events develop, the activity of users that react when a high-impact event happens, and the multimodal content that can be exploited to generate a comprehensive summary of the event. Our current work involves the creation of a geopolitical model that relates events and countries, allowing us to discover international relations; the study of what features make an event susceptible to provoke high activity from users, and a characterization that allows us to predict with high precision which events are going to produce high activity. This includes our ongoing work on generating automatic multimodal summaries of events based on the assumption that the users describe the non-textual content in their tweets when they express their facts and opinions around events.

A Learning-Based Framework for Memory-Bounded Heuristic Search: First Results
Carlos Hernández Ulloa, Jorge Baier, William Yeoh, Vadim Bulitko, Sven Koenig. Proceedings of the 12th Annual Symposium on Combinatorial Search (SoCS 2019), July 2019. 978-1-57735-808-4.

Introduction Memory-bounded search algorithms are typically used when the search space is too large for regular best-first search algorithms like A* to store in memory. There exists a large class of memory-bounded best-first search algorithms including Depth-First Branch-and-Bound (DFBnB), Iterative Deepening A* (IDA*) (Korf 1985), Recursive Best-First Search (RBFS) (Korf 1993), and Simplified Memory-Bounded A* (SMA*) (Russell 1992). Each of these algorithms rely on a different strategy to ensure that they use only a bounded amount of memory: IDA* bounds the amount of memory used by repeatedly running depth-first searches, increasing the explored depth at each iteration. RBFS uses lower and upper bounds that are tightened over time as it explores the search space while keeping only b · d nodes in memory, where b is the branching factor and d is the depth of the tree. And, finally, SMA* keeps only a bounded number of nodes in memory by pruning the least promising nodes from the OPEN list when it runs out of memory. In this abstract, we summarize an alternative approach to memory-bounded best-first search. It is motivated by realtime heuristic search algorithms (Korf 1990), many of which iterate the following steps until the goal is reached: up to k nodes are expanded, where k is a user-defined bound; the h values of expanded nodes are updated to make them more informed; the agents moves along a path along the search tree just expanded. We propose a general framework that iteratively (1) runs a memory-bounded best-first search algorithm that terminates when k nodes are generated. If no solution is found, (2) it updates the h-values of the generated nodes, and (3) purges the h values of some nodes from memory. As such, the total number of h-values ever stored by our approach is upper-bounded by a constant. Under certain (reasonable) conditions, our framework is complete and preserves the (sub)optimality guarantees of the given best-first search algorithm in tree-shaped search spaces. The main conceptual difference between our framework and the SMA* algorithm is that it can be combined with any bestfirst algorithm with very minor modifications. We present experimental results where we plug into our framework memory-bounded variants of Weighted A* (Pohl 1970). On traveling salesman problems we show that our framework is often able to find better solutions than DFBnB and Weighted DFBnB (wDFBnB) and in a smaller amount of time, especially in problems with large search spaces.

Compiling Cost-Optimal Multi-Agent Pathfinding to ASP
Rodrigo N. Gómez, Carlos Hernández, Jorge Baier. Proceedings of the 12th Annual Symposium on Combinatorial Search (SoCS 2019), July 2019. 978-1-57735-808-4.

Introduction: Multi-Agent Pathfinding (MAPF) over grids is the problem of finding n non-conflicting paths that lead n agents from a given initial cell to a given goal cell. Sum-of-costsoptimal MAPF, or simply cost-optimal MAPF, in addition, minimizes the total number of actions performed by each agent before stopping at the goal. Being a combinatorial problem in nature, a number of compilations from MAPF to Satisfiability (SAT) (Surynek et al. 2016) and Answer Set Programming (ASP) exist (Erdem et al. 2013; Gebser et al. 2018). Here we propose and evaluate a new compilation of MAPF over grids to ASP. Unlike existing compilations we are aware of, both to SAT and to ASP, our encoding is the first that produces a number of clauses that is linear on the number of agents. In addition, the clauses that allow representing the optimization objective are also efficiently written, and do not depend on the size of the grid. Like makespan-optimal approaches, our algorithm searches for cost-optimal solutions with increasing makespan. When a solution is found a provably correct upper bound on the maximum makespan at which a true cost-optimal solution exists is computed, and the solver is rerun once more.

Boundedness of Conjunctive Regular Path Queries
Pablo Barceló and Diego Figueira and Miguel Romero. 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019). Leibniz International Proceedings in Informatics (LIPIcs). Vol 132.

We study the boundedness problem for unions of conjunctive regular path queries with inverses (UC2RPQs). This is the problem of, given a UC2RPQ, checking whether it is equivalent to a union of conjunctive queries (UCQ). We show the problem to be ExpSpace-complete, thus coinciding with the complexity of containment for UC2RPQs. As a corollary, when a UC2RPQ is bounded, it is equivalent to a UCQ of at most triple-exponential size, and in fact we show that this bound is optimal. We also study better behaved classes of UC2RPQs, namely acyclic UC2RPQs of bounded thickness, and strongly connected UCRPQs, whose boundedness problem is, respectively, PSpace-complete and Pi_2^P-complete. Most upper bounds exploit results on limitedness for distance automata, in particular extending the model with alternation and two-wayness, which may be of independent interest.

Monadic Decomposability of Regular Relations
Pablo Barcelo, Chih-Duo Hong, Xuan-Bach Le, Anthony W. Lin, Reino Niskanen. 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019). Leibniz International Proceedings in Informatics (LIPIcs). Vol 132.

Monadic decomposibility – the ability to determine whether a formula in a given logical theory can be decomposed into a boolean combination of monadic formulas – is a powerful tool for devising a decision procedure for a given logical theory. In this paper, we revisit a classical decision problem in automata theory: given a regular (a.k.a. synchronized rational) relation, determine whether it is recognizable, i.e., it has a monadic decomposition (that is, a representation as a boolean combination of cartesian products of regular languages). Regular relations are expressive formalisms which, using an appropriate string encoding, can capture relations definable in Presburger Arithmetic. In fact, their expressive power coincide with relations definable in a universal automatic structure; equivalently, those definable by finite set interpretations in WS1S (Weak Second Order Theory of One Successor). Determining whether a regular relation admits a recognizable relation was known to be decidable (and in exponential time for binary relations), but its precise complexity still hitherto remains open. Our main contribution is to fully settle the complexity of this decision problem by developing new techniques employing infinite Ramsey theory. The complexity for DFA (resp. NFA) representations of regular relations is shown to be NLOGSPACE-complete (resp. PSPACE-complete).

On the reproducibility of experiments of indexing repetitive document collections
Antonio Fariña, Miguel A. Martínez-Prieto, Francisco Claude, Gonzalo Navarro, Juan J. Lastra-Díaz, Nicola Prezza, Diego Seco, On the reproducibility of experiments of indexing repetitive document collections, Information Systems, Volume 83, 2019, Pages 181-194, ISSN 0306-4379.

This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work Claude et al., (2016). In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as a reproducibility package.


MāOri Loanwords: A Corpus of New Zealand English Tweets
David Trye, Andreea Calude, Felipe Bravo-Marquez, Te Taka Keegan. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. DOI: 10.18653/v1/P19-2018

Māori loanwords are widely used in New Zealand English for various social functions by New Zealanders within and outside of the Māori community. Motivated by the lack of linguistic resources for studying how Māori loanwords are used in social media, we present a new corpus of New Zealand English tweets. We collected tweets containing selected Māori words that are likely to be known by New Zealanders who do not speak Māori. Since over 30% of these words turned out to be irrelevant, we manually annotated a sample of our tweets into relevant and irrelevant categories. This data was used to train machine learning models to automatically filter out irrelevant tweets.

A Lightweight Representation of News Events on Social Media
Mauricio Quezada and Barbara Poblete. 2019. A Lightweight Representation of News Events on Social Media. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'19). Association for Computing Machinery, New York, NY, USA, 1049–1052. DOI:

The sheer amount of newsworthy information published by users in social media platforms makes it necessary to have efficient and effective methods to filter and organize content. In this scenario, off-the-shelf methods fail to process large amounts of data, which is usually approached by adding more computational resources. Simple data aggregations can help to cope with space and time constraints, while at the same time improve the effectiveness of certain applications, such as topic detection or summarization. We propose a lightweight representation of newsworthy social media data. The proposed representation leverages microblog features, such as redundancy and re-sharing capabilities, by using surrogate texts from shared URLs and word embeddings. Our representation allows us to achieve comparable clustering results to those obtained by using the complete data, while reducing running time and required memory. This is useful when dealing with noisy and raw user-generated social media data.

Hate speech detection is not as easy as you may think: A closer look at model validation (extended version)
Aymé Arango, Jorge Pérez, and Barbara Poblete. 2019. Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'19). Association for Computing Machinery, New York, NY, USA, 45–54. DOI:

Hate speech is an important problem that is seriously affecting the dynamics and usefulness of online social communities. Large scale social platforms are currently investing important resources into automatically detecting and classifying hateful content, without much success. On the other hand, the results reported by state-of-the-art systems indicate that supervised approaches achieve almost perfect performance but only within specific datasets. In this work, we analyze this apparent contradiction between existing literature and actual applications. We study closely the experimental methodology used in prior work and their generalizability to other datasets. Our findings evidence methodological issues, as well as an important dataset bias. As a consequence, performance claims of the current state-of-the-art have become significantly overestimated. The problems that we have found are mostly related to data overfitting and sampling issues. We discuss the implications for current research and re-conduct experiments to give a more accurate picture of the current state-of-the art methods.

Cell cycle and protein complex dynamics in discovering signaling pathways
Inostroza, D., Hernández, C., Seco, D., Navarro, G., & Olivera-Nappa, A. (2019). Cell cycle and protein complex dynamics in discovering signaling pathways. Journal of Bioinformatics and Computational Biology, 1950011. DOI:

Signaling pathways are responsible for the regulation of cell processes, such as monitoring the external environment, transmitting information across membranes, and making cell fate decisions. Given the increasing amount of biological data available and the recent discoveries showing that many diseases are related to the disruption of cellular signal transduction cascades, in silico discovery of signaling pathways in cell biology has become an active research topic in past years. However, reconstruction of signaling pathways remains a challenge mainly because of the need for systematic approaches for predicting causal relationships, like edge direction and activation/inhibition among interacting proteins in the signal flow. We propose an approach for predicting signaling pathways that integrates protein interactions, gene expression, phenotypes, and protein complex information. Our method first finds candidate pathways using a directed-edge-based algorithm and then defines a graph model to include causal activation relationships among proteins, in candidate pathways using cell cycle gene expression and phenotypes to infer consistent pathways in yeast. Then, we incorporate protein complex coverage information for deciding on the final predicted signaling pathways. We show that our approach improves the predictive results of the state of the art using different ranking metrics.24


When is Ontology-Mediated Querying Efficient?
P. Barceló, C. Feier, C. Lutz and A. Pieris, "When is Ontology-Mediated Querying Efficient?," 2019 34th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), Vancouver, BC, Canada, 2019, pp. 1-13, doi: 10.1109/LICS.2019.8785823

In ontology-mediated querying, description logic (DL) ontologies are used to enrich incomplete data with domain knowledge which results in more complete answers to queries. However, the evaluation of ontology-mediated queries (OMQs) over relational databases is computationally hard. This raises the question when OMQ evaluation is efficient, in the sense of being tractable in combined complexity or fixed-parameter tractable. We study this question for a range of ontology-mediated query languages based on several important and widely-used DLs, using unions of conjunctive queries as the actual queries. For the DL ELHI⊥, we provide a characterization of the classes of OMQs that are fixed-parameter tractable. For its fragment ELH⊥ dr , which restricts the use of inverse roles, we provide a characterization of the classes of OMQs that are tractable in combined complexity. Both results are in terms of equivalence to OMQs of bounded tree width and rest on a reasonable assumption from parameterized complexity theory. They are similar in spirit to Grohe’s seminal characterization of the tractable classes of conjunctive queries over relational databases. We further study the complexity of the meta problem of deciding whether a given OMQ is equivalent to an OMQ of bounded tree width, providing several completeness results that range from NP to 2ExpTIME, depending on the DL used. We also consider the DL-Lite family of DLs, including members that, unlike εLHI  , admit functional roles.

A Theoretical View on Reverse Engineering Problems for Database Query Languages
Pablo Barceló; University of Chile (CL). CEUR Workshop Proceedings. DL 2019, International Workshop on Description Logics, June 18-21, 2019.

Abstract. A typical reverse engineering problem for a query language L is, given a database D and a sets P and N of tuples over D labeled as positive and negative examples, respectively, is there a query q in L that explains P and N, i.e., the evaluation of q on D contains all positive examples in P and none of the negative examples in N? Applications of reverse engineering problems include query-by-example, classifier engineering, and the study of the expressive power of query languages. In this talk I will present a family of tests that solve the reverse engineering problem described above for several query languages of interest, e.g., FO, CQ, UCQs, RPQs, CRPQs, etc. We will see that in many cases such tests directly provide optimal bounds for the problem, as well as for the size of the smallest query that explains the given labeled examples. I will also present restrictions that alleviate the complexity of the problem when it is too high. Finally, I will develop the relationship between reverse engineering and a separability problem recently introduced to assist the task of feature engineering with data management tools.

The Paradox of Participation Versus Misinformation: Social Media, Political Engagement, and the Spread of Misinformation
Sebastián Valenzuela, Daniel Halpern, James E. Katz & Juan Pablo Miranda (2019) The Paradox of Participation Versus Misinformation: Social Media, Political Engagement, and the Spread of Misinformation, Digital Journalism, 7:6, 802-823, DOI: 10.1080/21670811.2019.1623701

The mechanisms by which users of platforms such as Facebook and Twitter spread misinformation are not well understood. In this study, we argue that the effects of informational uses of social media on political participation are inextricable from its effects on misinformation sharing. That is, political engagement is both a major consequence of using social media for news as well as a key antecedent of sharing misinformation. We test our expectations via a two-wave panel survey of online media users in Chile, a country experiencing information disorders comparable to those of the global North. Analyses of the proposed and alternative causal models with two types of structural equation specifications (fixed effects and autoregressive) support our theoretical model. We close with a discussion on how changes in the way people engage with news and politics – brought about by social media – have produced a new dilemma: how to sustain a citizenry that is enthusiastically politically active, yet not spreading misinformation?

Connecting Knowledge Compilation Classes Width Parameters
Amarilli, A., Capelli, F., Monet, M. et al. Connecting Knowledge Compilation Classes Width Parameters. Theory Comput Syst 64, 861–914 (2020).

The field of knowledge compilation establishes the tractability of many tasks by studying how to compile them to Boolean circuit classes obeying some requirements such as structuredness, decomposability, and determinism. However, in other settings such as intensional query evaluation on databases, we obtain Boolean circuits that satisfy some width bounds, e.g., they have bounded treewidth or pathwidth. In this work, we give a systematic picture of many circuit classes considered in knowledge compilation and show how they can be systematically connected to width measures, through upper and lower bounds. Our upper bounds show that bounded-treewidth circuits can be constructively converted to d-SDNNFs, in time linear in the circuit size and singly exponential in the treewidth; and that bounded-pathwidth circuits can similarly be converted to uOBDDs. We show matching lower bounds on the compilation of monotone DNF or CNF formulas to structured targets, assuming a constant bound on the arity (size of clauses) and degree (number of occurrences of each variable): any d-SDNNF (resp., SDNNF) for such a DNF (resp., CNF) must be of exponential size in its treewidth, and the same holds for uOBDDs (resp., n-OBDDs) when considering pathwidth. Unlike most previous work, our bounds apply to any formula of this class, not just a well-chosen family. Hence, we show that pathwidth and treewidth respectively characterize the efficiency of compiling monotone DNFs to uOBDDs and d-SDNNFs with compilation being singly exponential in the corresponding width parameter. We also show that our lower bounds on CNFs extend to unstructured compilation targets, with an exponential lower bound in the treewidth (resp., pathwidth) when compiling monotone CNFs of constant arity and degree to DNNFs (resp., nFBDDs).

From Belief in Conspiracy Theories to Trust in Others: Which Factors Influence Exposure, Believing and Sharing Fake News
Halpern D., Valenzuela S., Katz J., Miranda J.P. (2019) From Belief in Conspiracy Theories to Trust in Others: Which Factors Influence Exposure, Believing and Sharing Fake News. In: Meiselwitz G. (eds) Social Computing and Social Media. Design, Human Behavior and Analytics. HCII 2019. Lecture Notes in Computer Science, vol 11578. Springer, Cham.

Drawing on social-psychological and political research, we offer a theoretical model that explains how people become exposed to fake news, come to believe in them and then share them with their contacts. Using two waves of a nationally representative sample of Chileans with internet access, we pinpoint the relevant causal factors. Analysis of the panel data indicate that three groups of variables largely explain these phenomena: (1) Personal and psychological factors such as belief in conspiracy theories, trust in others, education and gender; (2) Frequency and specific uses of social media; and (3) Political views and online activism. Importantly, personal and political-psychological factors are more relevant in explaining this behavior than specific uses of social media.

An Empirical Analysis of Rumor Detection on Microblogs with Recurrent Neural Networks
Bugueño M., Sepulveda G., Mendoza M. (2019) An Empirical Analysis of Rumor Detection on Microblogs with Recurrent Neural Networks. In: Meiselwitz G. (eds) Social Computing and Social Media. Design, Human Behavior and Analytics. HCII 2019. Lecture Notes in Computer Science, vol 11578. Springer, Cham.

The popularity of microblogging websites makes them important for information dissemination. The diffusion of large volumes of fake or unverified information could emerge and spread producing damage. Due to the ever-increasing volume of data and the nature of complex diffusion, automatic rumor detection is a very challenging task. Supervised classification and other approaches have been widely used to identify rumors in social media posts. However, despite achieving competitive results, only a few studies have delved into the nature of the problem itself in order to identify key empirical factors that allow defining both the baseline models and their performance. In this work, we learn discriminative features from tweets content and propagation trees by following their sequential propagation structure. To do this we study the performance of a number of architectures based on recursive neural networks conditioning for rumor detection. In addition, to ingest tweets into each network, we study the effect of two different word embeddings schemes: Glove and Google news skip-grams. Results on the Twitter16 dataset show that model performance depends on many empirical factors and that some specific experimental configurations consistently drive to better results.

Estimating Ground Shaking Regions with Social Media Propagation Trees
Mendoza M., Poblete B., Valderrama I. (2019) Estimating Ground Shaking Regions with Social Media Propagation Trees. In: Meiselwitz G. (eds) Social Computing and Social Media. Design, Human Behavior and Analytics. HCII 2019. Lecture Notes in Computer Science, vol 11578. Springer, Cham.

The Mercalli scale of quake damages is based on perceived effects and it has a strong dependence on observers. Recently, we proposed a method for ground shaking intensity estimation based on lexical features extracted from tweets, showing good performance in terms of mean absolute error (MAE). One of the flaws of that method is the detection of the region of interest, i.e., the area of a country where the quake was felt. Our previous results showed enough recall in terms of municipality recovery but a poor performance in terms of accuracy. One of the reasons that help to explain this effect is the presence of data noise as many people comment or confirm a quake in areas where the event was unperceived. This happens because people get awareness of an event by watching news or by word-of-mouth propagation. To alleviate this problem in our earthquake detection system we study how propagation features behave in a region of interest estimation task. The intuition behind our study is that the patterns that characterize a word-of-mouth propagation differ from the patterns that characterize a perceived event. If this intuition is true, we expect to separate both kinds of propagation modes. We do this by computing a number of features to represent propagation trees. Then, we trained a learning algorithm using our features in the specific task of region of interest estimation. Our results show that propagation features behave well in this task, outperforming lexical features in terms of accuracy.

Claim Behavior over Time in Twitter
Weiss F., Espinoza I., Hurtado J., Mendoza M. (2019) Claim Behavior over Time in Twitter. In: Meiselwitz G. (eds) Social Computing and Social Media. Design, Human Behavior and Analytics. HCII 2019. Lecture Notes in Computer Science, vol 11578. Springer, Cham.

Social media is the primary source of information for many people around the world, not only to know about their families and friends but also to read about news and trends in different areas of interest. Fake News or rumors can generate big problems of misinformation, being able to change the mindset of a large group of people concerning a specific topic. Many companies and researchers have put their efforts into detecting these rumors with machine learning algorithms creating reports of the influence of these “news” in social media ( Only a few studies have been made in detecting rumors in real-time, considering the first hours of propagation. In this work, we study the spread of a claim, analyzing different characteristics and how propagation patterns behave in time. Experiments show that rumors have different behaviours that can be used to classify them within the first hours of propagation.

Trajectory Patterns Based on Segment-Cutting Clustering
Luis Cabrera-Crot, Andrea Rodríguez, and Diego Seco; Universidad de Concepción, Chile. Mónica Caniupán, Universidad del Bío-Bío, Chile. CEUR Workshop Proceedings. AMW 2019, June 3-7, 2019.

Trajectory patterns characterize similar behaviors among trajectories, which play an important role in applications such as urban planning, traffic congestion control, and studies of animal migration and natural phenomena. In this paper we model trajectories as a sequence of line segments that represent the steady movement of an object along time. We use a segment-clustering process to group trajectories’ segments and partial segments based on their temporal and spatial closeness. Then, it defines a trajectory pattern that results from the aggregation of segment clusters, aggregation that is not only based on spatial and temporal sequentiality, but also on the compatibility of trajectories in each segment cluster. The experimental assessment shows the effectiveness of the method.


Linear Recursion in G-CORE
Valentina Urzua and Claudio Gutierrez. Department of Computer Science, Universidad de Chile and IMFD. CEUR Workshop Proceedings. Alberto Mendelzon Workshop on Foundations of Data Management. June 3–7, 2019.

G-CORE is a query language with two key characteristics: It is closed under graphs and incoporates paths as first-class citizens. Currently G-CORE does not have recursion. In this paper we propose this extension and show how to code classical polynomial graph algorithms with it.


RDF and Property Graphs Interoperability: Status and Issues
Renzo Angles, Universidad de Talca, Chile; Millennium Institute for Foundational Research on Data, Chile. Harsh Thakkar, University of Bonn, Germany. Dominik Tomaszuk, University of Bialystok, Poland. CEUR Workshop Proceedings, AMW 2019, June 3-7, 2019.

RDF and Property Graph databases are two approaches for data management that are based on modeling, storing and querying graph-like data. In this paper, we present a short study about the interoperability between these approaches. We review the current solutions to the problem, identify their features, and discuss the inherent issues.


Preferences for Redistribution and Tax Burdens in Latin America
Bogliaccini, J., & Luna, J. (2019). Preferences for Redistribution and Tax Burdens in Latin America. In G. Flores-Macías (Ed.), The Political Economy of Taxation in Latin America (pp. 219-241). Cambridge: Cambridge University Press. DOI
Agenda Setting and Journalism
Valenzuela, S. (2019, June 25). Agenda Setting and Journalism. Oxford Research Encyclopedia of Communication. Retrieved 27 Oct. 2020, from

People use the news media to learn about the world beyond their family, neighborhood, and workplace. As news consumers, we depend on what television, social media, websites, radio stations, and newspapers decide to inform us about. This is because all news media, whether through journalists or digital algorithms, select, process, and filter information to their users. Over time, the aspects that are prominent in the news media usually become prominent in public opinion. The ability of journalists to influence which issues, aspects of these issues, and persons related to these issues, are perceived as the most salient has come to be called the agenda-setting effect of journalism.

First described by Maxwell McCombs and Donald Shaw in a seminal study conducted during the 1968 elections in the United States, agenda-setting theory has expanded to include several other aspects beyond the transfer of salience of issues from the media agenda to the public agenda. These aspects include: the influence of journalism on the attributes of issues and people that make news; the networks between the different elements in the media and public agendas; the determinants of the news media agenda; the psychological mechanisms that regulate agenda-setting effects; and the consequences of agenda setting on both citizens’ and policymakers’ attitudes and behaviors. As one of the most comprehensive and international theories of journalism studies available, agenda setting continues to evolve in the expanding digital media landscape.


Testing the Hypothesis of «Impressionable Years» With Willingness to Self-Censor in Chile
Nicolle Etchegaray, Andrés Scherman, Sebastián Valenzuela, Testing the Hypothesis of “Impressionable Years” With Willingness to Self-Censor in Chile, International Journal of Public Opinion Research, Volume 31, Issue 2, Summer 2019, Pages 331–348,

This study seeks to deepen our understanding of the factors that explain individuals’ willingness to self-censor (WtSC)—the proclivity to withhold an opinion from an audience perceived to disagree with that opinion. It does so by testing the “impressionable years” hypothesis, which states that the historical context experienced between the age of 18 and 25 years has a lasting effect on individual dispositions such as WtSC. The study was conducted in Chile, an ideal case to explore possible cohort effects because of the profound political changes experienced there in the past 50 years. Analysis of an original cross-sectional survey shows that—as expected—people who came of age in periods of political repression exhibit significantly higher levels of WtSC later in life compared with those who grew up during less repressive times.

Regularizing Conjunctive Features for Classification
Pablo Barceló, Alexander Baumgartner, Victor Dalmau, and Benny Kimelfeld. 2019. Regularizing Conjunctive Features for Classification. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '19). Association for Computing Machinery, New York, NY, USA, 2–16. DOI:

We consider the feature-generation task wherein we are given a database with entities labeled as positive and negative examples, and the goal is to find feature queries that allow for a linear separation between the two sets of examples. We focus on conjunctive feature queries, and explore two fundamental problems: (a) deciding whether separating feature queries exist (separability), and (b) generating such queries when they exist. In the approximate versions of these problems, we allow a predefined fraction of the examples to be misclassified. To restrict the complexity of the generated classifiers, we explore various ways of regularizing (i.e., imposing simplicity constraints on) them by limiting their dimension, the number of joins in feature queries, and their generalized hypertree width (ghw). Among other results, we show that the separability problem is tractable in the case of bounded ghw; yet, the generation problem is intractable, simply because the feature queries might be too large. So, we explore a third problem: classifying new entities without necessarily generating the feature queries. Interestingly, in the case of bounded ghw we can efficiently classify without ever explicitly generating the feature queries.

Expressiveness of Matrix and Tensor Query Languages in terms of ML Operators
Pablo Barceló, Nelson Higuera, Jorge Pérez, and Bernardo Subercaseaux. 2019. Expressiveness of Matrix and Tensor Query Languages in terms of ML Operators. In Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning (DEEM'19). Association for Computing Machinery, New York, NY, USA, Article 9, 1–5. DOI:

Tensors are one of the most widely used data structures in modern Machine Learning applications. Although they provide a flexible way of storing and accessing data, they often expose too many low-level details that may result in error prone code that is difficult to maintain and extend. Abstracting low-level functionalities into high-level operators in the form of a query language is a task in which the Data Management community has extensive experience. It is thus important to understand how such an experience can be applied in the design of useful languages for tensor manipulation.

In this short paper we study a matrix and a tensor query language that have been recently proposed in the database literature. We show, by using examples, how these proposals are in line with the practical interest in rethinking tensor abstractions. On the technical side, we compare the two languages in terms of operators that naturally arise in Machine Learning pipelines, such as convolution, matrix-inverse, and Einstein summation. We hope our results to provide a theoretical kick-off for the discussion on the design of core declarative query languages for tensors.

Database Repairs and Consistent Query Answering: Origins and Further Developments
Leopoldo Bertossi. 2019. Database Repairs and Consistent Query Answering: Origins and Further Developments. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '19). Association for Computing Machinery, New York, NY, USA, 48–58. DOI:

In this article we review the main concepts around database repairs and consistent query answering, with emphasis on tracing back the origin, motivation, and early developments. We also describe some research directions that has spun from those main concepts and the original line of research. We emphasize, in particular, fruitful and recent connections between repairs and causality in databases.

Rapid Sequence Matching for Visualization Recommender Systems
Shaoliang Nie, Christopher G. Healey, Rada Y. Chirkova, and Juan L. Reutter. 2019. Rapid Sequence Matching for Visualization Recommender Systems. In Proceedings of the 45th Graphics Interface Conference on Proceedings of Graphics Interface 2019 (GI'19). Canadian Human-Computer Communications Society, Waterloo, CAN, Article 5, 1–8. DOI:

We present a method to support high quality visualization recommendations for analytic tasks. Visualization converts large datasets into images that allow viewers to efficiently explore, discover, and validate within their data. Visualization recommenders have been proposed that store past sequences: an ordered collection of design choices leading to successful task completion; then match them against an ongoing visualization construction. Based on this matching, a system recommends visualizations that better support the analysts’ tasks. A problem of scalability occurs when many sequences are stored. One solution would be to index the sequence database. However, during matching we require sequences that are similar to the partially constructed visualization, not only those that are identical. We implement a locality sensitive hashing algorithm that converts visualizations into set representations, then uses Jaccard similarity to store similar sequence nodes in common hash buckets. This allows us to match partial sequences against a database containing tens of thousands of full sequences in less than 100ms. Experiments show that our algorithm locates 95% or more of the sequences found in an exhaustive search, producing high-quality visualization recommendations.



Efficient Logspace Classes for Enumeration, Counting, and Uniform Generation
Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, Cristian Riveros. PODS '19: Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsJune 2019 Pages 59–73

We study two simple yet general complexity classes, which provide a unifying framework for ecient query evaluation in areas like graph databases and information extraction, among others. We investigate the complexity of three fundamental algorithmic problems for these classes: enumeration, counting and uniform generation of solutions, and show that they have several desirable properties in this respect. Both complexity classes are defined in terms of non deterministic logarithmic-space transducers (NL transducers). For the first class, we consider the case of unambiguous NL transducers, and we prove constant delay enumeration, and both counting and uniform generation of solutions in polynomial time. For the second class, we consider unrestricted NL transducers, and we obtain polynomial delay enumeration, approximate counting in polynomial time, and polynomialtime randomized algorithms for uniform generation. More specifically, we show that each problem in this second class admits a fully polynomial-time randomized approximation scheme (FPRAS) and a polynomial-time Las Vegas algorithm (with preprocessing) for uniform generation. Remarkably, the key idea to prove these results is to show that the fundamental problem #NFA admits an FPRAS, where #NFA is the problem of counting the number of strings of length n (given in unary) accepted by a non-deterministic finite automaton (NFA). While this problem is known to be #P-complete and, more precisely, SpanL-complete, it was open whether this problem admits an FPRAS. In this work, we solve this open problem, and obtain as a welcome corollary that every function in SpanL admits an FPRAS.


Repair-Based Degrees of Database Inconsistency
Bertossi L. (2019) Repair-Based Degrees of Database Inconsistency. In: Balduccini M., Lierler Y., Woltran S. (eds) Logic Programming and Nonmonotonic Reasoning. LPNMR 2019. Lecture Notes in Computer Science, vol 11481. Springer, Cham.

We propose and investigate a concrete numerical measure of the inconsistency of a database with respect to a set of integrity constraints. It is based on a database repair semantics associated to cardinality-repairs. More specifically, it is shown that the computation of this measure can be intractable in data complexity, but answer-set programs are exhibited that can be used to compute it. Furthermore, its is established that there are polynomial-time deterministic and randomized approximations. The behavior of this measure under small updates is analyzed, obtaining fixed-parameter tractability results. We explore abstract extensions of this measure that appeal to generic classes of database repairs. Inconsistency measures and repairs at the attribute level are investigated as a particular, but relevant and natural case.

Using Twitter to Infer User Satisfaction With Public Transport: The Case of Santiago, Chile
J. T. Méndez, H. Lobel, D. Parra and J. C. Herrera, "Using Twitter to Infer User Satisfaction With Public Transport: The Case of Santiago, Chile," in IEEE Access, vol. 7, pp. 60255-60263, 2019, doi: 10.1109/ACCESS.2019.2915107

User satisfaction is an important aspect to consider in any public transport system, and as such, regular and sound measurements of its levels are fundamental. However, typical evaluation schemes involve costly and time-consuming surveys. As a consequence, their frequency is not enough to properly and timely characterize the satisfaction of the users. In this paper, we propose a methodology, based on Twitter data, to capture the satisfaction of a large mass of users of public transport, allowing us to improve the characterization and location of their satisfaction level. We analyzed a massive volume of tweets referring to the public transport system in Santiago, Chile (Transantiago) using text mining techniques, such as sentiment analysis and topic modeling, in order to capture and group bus users’ expressions. Results show that, although the level of detail and variety of answers obtained from surveys are higher than the ones obtained by our method, the amount of bus stops and bus services covered by the proposed scheme is larger. Moreover, the proposed methodology can be effectively used to diagnose problems in a timely manner, as it is able to identify and locate trends, and issues related to bus operating firms, whereas surveys tend to produce average answers. Based on the consistency and logic of the results, we argue that the proposed methodology can be used as a valuable complement to surveys, as both present different, but compatible characteristics.


AffectiveTweets: a Weka Package for Analyzing Affect in Tweets
Bravo-Marquez, Felipe & Pfahringer, Bernhard & Mohammad, Saif & Frank, Eibe. (2019). AffectiveTweets: a Weka Package for Analyzing Affect in Tweets. Journal of Machine Learning Research. 20. 1-6.

AffectiveTweets is a set of programs for analyzing emotion and sentiment of social media messages such as tweets. It is implemented as a package for the Weka machine learning workbench and provides methods for calculating state-of-the-art affect analysis features from tweets that can be fed into machine learning algorithms implemented in Weka. It also implements methods for building affective lexicons and distant supervision methods for training affective models from unlabeled tweets. The package was used by several teams in the shared tasks: EmoInt 2017 and Affect in Tweets SemEval 2018 Task 1.

Sketch-Aided Retrieval of Incomplete 3D Cultural Heritage Objects
Stefan Lengauer, Alexander Komar, Arniel Labrada, Stephan Karl, Elisabeth Trinkl, Reinhold Preiner, Benjamin Bustos, Tobias Schreck. 12th Eurographics Workshop on 3D Object Retrieval, Italy, May 2019.

Due to advances in digitization technology, documentation efforts and digital library systems, increasingly large collections of visual Cultural Heritage (CH) object data becomes available, offering rich opportunities for domain analysis, e.g., for comparing, tracing and studying objects created over time. In principle, existing shape- and image-based similarity search methods can aid such domain analysis tasks. However, in practice, visual object data are given in different modalities, including 2D, 3D, sketches or conventional drawings like profile sections or unwrappings. In addition, collections may be distributed across different publications and repositories, posing a challenge for implementing encompassing search and analysis systems. We introduce a methodology and system for cross-modal visual search in CH object data. Specifically, we propose a new query modality based on 3D views enhanced by user sketches (3D+sketch). This allows for adding new context to the search, which is useful e.g., for searching based on incomplete query objects, or for testing hypotheses on existence of certain shapes in a collection. We present an appropriately designed workflow for constructing query views from incomplete 3D objects enhanced by a user sketch based on shape completion and texture inpainting. Visual cues additionally help users compare retrieved objects with the query. We apply our method on a set of relevant 3D and view-based CH object data, demonstrating the feasibility of our approach and its potential to support analysis of domain experts in Archaeology and the field of CH in general.


Bitcoin Price Prediction Through Opinion Mining
Germán Cheuque Cerda and Juan L. Reutter. 2019. Bitcoin Price Prediction Through Opinion Mining. In Companion Proceedings of The 2019 World Wide Web Conference (WWW '19). Association for Computing Machinery, New York, NY, USA, 755–762. DOI:

The Bitcoin protocol and its underlying cryptocurrency have started to shape the way we view digital currency, and opened up a large list of new and interesting challenges. Amongst them, we focus on the question of how is the price of digital currencies affected, which is a natural question especially when considering the price rollercoaster we witnessed for bitcoin in 2017-2018. We work under the hypothesis that price is affected by the web footprint of influential people, we refer to them as crypto-influencers.

In this paper we provide neural models for predicting bitcoin price. We compare what happens when the model is fed only with recent price history versus what happens when fed, in addition, with a measure of the positivity or negativity of the sayings of these influencers, measured through a sentiment analysis of their twitter posts. We show preliminary evidence that twitter data should indeed help to predict the price of bitcoin, even though the measures we use in this paper have a lot of room for refinement. In particular, we also discuss the challenges of measuring the correct sensation of these posts, and discuss the work that should help improving our discoveries even further.

NIFify: Towards Better Quality Entity Linking Datasets
Henry Rosales-Méndez, Aidan Hogan, and Barbara Poblete. 2019. NIFify: Towards Better Quality Entity Linking Datasets. In Companion Proceedings of The 2019 World Wide Web Conference (WWW '19). Association for Computing Machinery, New York, NY, USA, 815–818. DOI:

The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with a corresponding unambiguous entry in a Knowledge Base. The evaluation of EL systems relies on the comparison of their results against gold standards. A common format used to represent gold standard datasets is the NLP Interchange Format (NIF), which uses RDF as a data model. However, creating gold standard datasets for EL is a time-consuming and error-prone process. In this paper we propose a tool called NIFify to help manually generate, curate, visualize and validate EL annotations; the resulting tool is useful, for example, in the creation of gold standard datasets. NIFify also serves as a benchmark tool that enables the assessment of EL results. Using the validation features of NIFify, we further explore the quality of popular EL gold standards.

Car Theft Reports: a Temporal Analysis from a Social Media Perspective
Juglar Diaz and Barbara Poblete. 2019. Car Theft Reports: a Temporal Analysis from a Social Media Perspective. In Companion Proceedings of The 2019 World Wide Web Conference (WWW '19). Association for Computing Machinery, New York, NY, USA, 779–782. DOI:

Complex human behaviors related to crime require multiple sources of information to understand them. Social Media is a place where people share opinions and news. This allows events in the physical world like crimes to be reflected on Social Media. In this paper we study crimes from the perspective of Social Media, specifically car theft and Twitter. We use data of car theft reports from Twitter and car insurance companies in Chile to perform a temporal analysis. We found that there is an increasing correlation in recent years between the number of car theft reports in Twitter and data collected from insurance companies. We performed yearly, monthly, daily and hourly analyses. Though Twitter is an unstructured source and very noisy, it allows you to estimate the volume of thefts that are reported by the insurers. We experimented with a Moving Average to predict the tendency in the number of car theft reported to insurances using Twitter data and found that one month is the best time window for prediction.

Mining the Relationship BetweenCar Theft and Places of Social Interest in Santiago Chile
Karen Oróstica and Barbara Poblete. 2019. Mining the Relationship BetweenCar Theft and Places of Social Interest in Santiago Chile. In Companion Proceedings of The 2019 World Wide Web Conference (WWW '19). Association for Computing Machinery, New York, NY, USA, 811–814. DOI:

Recent work suggests that certain places can be more attractive for car theft based how many people regularly visit them, as well as other factors. In this sense, we must also consider the city or district itself where vehicles are stolen. All cities have different cultural and socioeconomic characteristics that influence car theft patterns. In particular, the distribution of public services and places attract a large crowd could play a key role in the occurrence of car theft. Santiago, a city that displays drastic socioeconomic differences among its districts, presents increasingly-high car theft rates. This represents a serious issue for the city, as for any other major city, which –at least for Santiago– has not been analyzed in depth using quantitative approaches. In this work, we present a preliminary study of how places that create social interest, such as restaurants, bars, schools, and shopping malls, increase car theft frequency in Santiago. We also study if some types of places are more attractive than others for this type of crime. To evaluate this, we propose to analyze car theft points (CTP) from insurance companies and their relationship with places of social interest (PSI) extracted from Google Maps, using a proximity based approach. Our findings show a high correlation between CTP and PSI for all of the social interest categories that we studied in the different districts of the Santiago. In particular our work contributes to the understanding of the social factors that are associated to car thefts.

Recommender Systems for Online Video Game Platforms: the Case of STEAM
Germán Cheuque, José Guzmán, and Denis Parra. 2019. Recommender Systems for Online Video Game Platforms: the Case of STEAM. In Companion Proceedings of The 2019 World Wide Web Conference (WWW '19). Association for Computing Machinery, New York, NY, USA, 763–771. DOI:

The world of video games has changed considerably over the recent years. Its diversification has dramatically increased the number of users engaged in online communities of this entertainment area, and consequently, the number and types of games available. This context of information overload underpins the development of recommender systems that could leverage the information that the video game platforms collect, hence following the trend of new games coming out every year. In this work we test the potential of state-of-the-art recommender models based respectively on Factorization Machines (FM), deep neural networks (DeepNN) and one derived from the mixture of both (DeepFM), chosen for their potential of receiving multiple inputs as well as different types of input variables. We evaluate our results measuring the ranking accuracy of the recommendation and the diversity/novelty of a recommendation list. All the algorithms achieve better results than a baseline based on implicit feedback (Alternating Least Squares model). The best performing algorithm is DeepNN, the high order interactions are more important than the low order ones for this recommendation task. We also analyze the effect of the sentiment extracted directly from game reviews, and find that it is not as relevant for recommendation as one might expect. We are the first in studying the aforementioned recommender systems over the context of online video game platforms, reporting novel results which could be used as baseline in future works.

Serialization for Property Graphs
Tomaszuk D., Angles R., Szeremeta Ł., Litman K., Cisterna D. (2019) Serialization for Property Graphs. In: Kozielski S., Mrozek D., Kasprowski P., Małysiak-Mrozek B., Kostrzewa D. (eds) Beyond Databases, Architectures and Structures. Paving the Road to Smart Data Processing and Analysis. BDAS 2019. Communications in Computer and Information Science, vol 1018. Springer, Cham.

Graph serialization is very important for the development of graph-oriented applications. In particular, serialization methods are fundamental in graph data management to support database exchange, benchmarking of systems, and data visualization. This paper presents YARS-PG, a data format for serializing property graphs. YARS-PG was designed to be simple, extensible and platform independent, and to support all the features provided by the current database systems based on the property graph data model.


Exposing the president: The political angle of a natural disaster in Chile
Magdalena Saldaña. International Symposium on Online Journalism ISOJ 2019.

Chile is a country with high levels of digital news consumption but decreasing levels of confidence in journalism and traditional news media outlets. In a place where natural disasters are common, Chilean citizens usually turn to digital and social media to find out more information about how events unfold. By relying on in-depth interviews with reporters who covered the 2014 earthquake in northern Chile, this study examines how Chilean journalists approached a highly politicized natural disaster. Results show that reporters covered the earthquake as a political issue due to editorial prompting, and they used social media as another way to get close to the sources they already know, but not to look for alternative sources. The implications of these findings for media scholars and practitioners relate to the normalization of social media use among journalists, and the influence of a news outlet’s political leaning on journalistic practices.


Proof-of-Learning: A Blockchain Consensus Mechanism Based on Machine Learning Competitions
F. Bravo-Marquez, S. Reeves and M. Ugarte, "Proof-of-Learning: A Blockchain Consensus Mechanism Based on Machine Learning Competitions," 2019 IEEE International Conference on Decentralized Applications and Infrastructures (DAPPCON), Newark, CA, USA, 2019, pp. 119-124, doi: 10.1109/DAPPCON.2019.00023

This article presents WekaCoin, a peer-to-peer cryptocurrency based on a new distributed consensus protocol called Proof-of-Learning. Proof-of-learning achieves distributed consensus by ranking machine learning systems for a given task. The aim of this protocol is to alleviate the computational waste involved in hashing-based puzzles and to create a public distributed and verifiable database of state-of-the-art machine learning models and experiments.

Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string
Mirabal, P., Abreu, J., & Seco, D. (2019). Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string. Pattern Recognition Letters, 120, 104–111. DOI:

Different pattern recognition techniques such as clustering, k-nearest neighbor classification, or instance reduction algorithms require prototypes to represent pattern classes. In many applications, strings are used to encode instances, for example, in contour representations or in biological data such as DNA, RNA, and protein sequences. Median strings have been used as representatives of a set of strings in different domains. Finding the median string is an NP-Complete problem for several formulations. Alternatively, heuristic approaches that iteratively refine an initial coarse solution by applying edit operations have been proposed. We propose here a novel algorithm that outperforms the state of the art heuristic approximations to the median string in terms of convergence speed by estimating the effect of a perturbation in the minimization of the expressions that define the median strings. We present comparative experiments to validate these results.


Content-based artwork recommendation: integrating painting metadata with neural and manually-engineered visual features
Messina, P., Dominguez, V., Parra, D. et al. Content-based artwork recommendation: integrating painting metadata with neural and manually-engineered visual features. User Model User-Adap Inter 29, 251–290 (2019).

Recommender Systems help us deal with information overload by suggesting relevant items based on our personal preferences. Although there is a large body of research in areas such as movies or music, artwork recommendation has received comparatively little attention, despite the continuous growth of the artwork market. Most previous research has relied on ratings and metadata, and a few recent works have exploited visual features extracted with deep neural networks (DNN) to recommend digital art. In this work, we contribute to the area of content-based artwork recommendation of physical paintings by studying the impact of the aforementioned features (artwork metadata, neural visual features), as well as manually-engineered visual features, such as naturalness, brightness and contrast. We implement and evaluate our method using transactional data from, an online artwork store. Our results show that artwork recommendations based on a hybrid combination of artist preference, curated attributes, deep neural visual features and manually-engineered visual features produce the best performance. Moreover, we discuss the trade-off between automatically obtained DNN features and manually-engineered visual features for the purpose of explainability, as well as the impact of user profile size on predictions. Our research informs the development of next-generation content-based artwork recommenders which rely on different types of data, from text to multimedia.

Dv2v: A Dynamic Variable-to-Variable Compressor
N. R. Brisaboa, A. Fariña, A. Gómez-Brandón, G. Navarro and T. V. Rodeiro, "Dv2v: A Dynamic Variable-to-Variable Compressor," 2019 Data Compression Conference (DCC), Snowbird, UT, USA, 2019, pp. 83-92,

We present D-v2v, a new dynamic (one-pass) variable-to-variable compressor. Variable-to-variable compression aims at using a modeler that gathers variable-length input symbols and a variable-length statistical coder that assigns shorter codewords to the more frequent symbols. In D-v2v, we process the input text word-wise to gather variable-length symbols that can be either terminals (new words) or non-terminals, subsequences of words seen before in the input text. Those input symbols are set in a vocabulary that is kept sorted by frequency. Therefore, those symbols can be easily encoded with dense codes. Our D-v2v permits real-time transmission of data, i.e. compression/transmission can begin as soon as data become available. Our experiments show thatD-v2vis able to overcome the compression ratios of the v2vDC, the state-of-the-art semi-static variable-to-variable compressor, and to almost reach p7zip values. It also draws a competitive performance at both compression and decompression.


A Compact Representation of Raster Time Series
N. Cruces, D. Seco, and G. Guitérrez, "A Compact Representation of Raster Time Series," 2019 Data Compression Conference (DCC), Snowbird, UT, USA, 2019, pp. 103-111, DOI:

The raster model is widely used in Geographic Information Systems to represent data that vary continuously in space, such as temperatures, precipitations, elevation, among other spatial attributes. In applications like weather forecast systems, not just a single raster, but a sequence of rasters covering the same region at different timestamps, known as a raster time series, needs to be stored and queried. Compact data structures have proven successful to provide space-efficient representations of rasters with query capabilities. Hence, a naive approach to save space is to use such a representation for each raster in a time series. However, in this paper, we show that it is possible to take advantage of the temporal locality that exists in a raster time series to reduce the space necessary to store it while keeping competitive query times for several types of queries


Datalog: Bag Semantics via Set Semantics
Leopoldo Bertossi and Georg Gottlob and Reinhard Pichler. 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs). Vol 127.

Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog, the so-called warded Datalog±, under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a practical point of view, this allows us to handle the bag semantics of Datalog by powerful, existing query engines for the required extension of Datalog. This use of Datalog± is extended to give a set semantics to duplicates in Datalog± itself. We investigate the properties of the resulting Datalog± programs, the problem of deciding multiplicities, and expressibility of some bag operations. Moreover, the proposed translation has the potential for interesting applications such as to Multiset Relational Algebra and the semantic web query language SPARQL with bag semantics.

A Formal Framework for Complex Event Processing
Alejandro Grez and Cristian Riveros and Martín Ugarte. 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs). Vol 127. DOI: 10.4230/LIPIcs.ICDT.2019.5.

Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and generalize. Moreover, there are no general techniques for evaluating CEP query languages with clear performance guarantees. In this paper we embark on the task of giving a rigorous and efficient framework to CEP. We propose a formal language for specifying complex events, called CEL, that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. With a well-defined semantics at hand, we discuss how to efficiently process complex events by evaluating CEL formulas with unary filters. We start by studying the syntactical properties of CEL and propose rewriting optimization techniques for simplifying the evaluation of formulas. Then, we introduce a formal computational model for CEP, called complex event automata (CEA), and study how to compile CEL formulas with unary filters into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by constant-delay enumeration of the results. Finally, we gather the main results of this work to present an efficient and declarative framework for CEP.


Evaluating content novelty in recommender systems
Mendoza, M., Torres, N. Evaluating content novelty in recommender systems. J Intell Inf Syst 54, 297–316 (2020).

Recommender systems are frequently evaluated using performance indexes based on variants and extensions of precision-like measures. As these measures are biased toward popular items, a list of recommendations simply must include a few popular items to perform well. To address the popularity bias challenge, new approaches for novelty and diversity evaluation have been proposed. On the one hand, novelty-based approaches model the quality of being new as apposed to that which is already known. Novelty approaches are commonly based on item views or user rates. On the other hand, diversity approaches model the quality of an item that is composed of different content elements. Diversity measures are commonly rooted in content-based features that characterize the diversity of the content of an item in terms of the presence/absence of a number of predefined nuggets of information. As item contents are also biased to popular contents (e.g., drama in movies or pop in music), diversity-based measures are also popularity biased. To alleviate the effect of popularity bias on diversity measures, we used an evaluation approach based on the degree of novelty of the elements that make up each item. We named this approach content novelty, as it mixes content and diversity approaches in a single and coherent evaluation framework. Experimental results show that our proposal is feasible and useful. Our findings demonstrate that the proposed measures yield consistent and interpretable results, producing insights that reduce the impact of popularity bias in the evaluation of recommender systems.


Microservice-Oriented Platform for Internet of Big Data Analytics: A Proof of Concept
Li, Z.; Seco, D.; Sánchez Rodríguez, A.E. Microservice-Oriented Platform for Internet of Big Data Analytics: A Proof of Concept. Sensors 2019, 19, 1134. doi

The ubiquitous Internet of Things (IoT) devices nowadays are generating various and numerous data from everywhere at any time. Since it is not always necessary to centralize and analyze IoT data cumulatively (e.g., the Monte Carlo analytics and Convergence analytics demonstrated in this article), the traditional implementations of big data analytics (BDA) will suffer from unnecessary and expensive data transmissions as a result of the tight coupling between computing resource management and data processing logic. Inspired by software-defined infrastructure (SDI), we propose the “micro service-oriented platform” to break the environmental monolith and further decouple data processing logics from their underlying resource management in order to facilitate BDA implementations in the IoT environment (which we name “IoBDA”). Given predesigned standard microservices with respect to specific data processing logics, the proposed platform is expected to largely reduce the complexity in and relieve inexperienced practices of IoBDA implementations. The potential contributions to the relevant communities include (1) new theories of a micro service-oriented platform on top of SDI and (2) a functional micro service-oriented platform for IoBDA with a group of predesigned microservices


The agenda-setting role of the news media
Sebastián Valenzuela, Maxwell McCombs. "The agenda-setting role of the news media" in "An Integrated Approach to Communication Theory and Research"; Chapter Eight, 99-112. March 2019. DOI: 10.4324/9780203710753-10

One of the important roles of the mass media is the setting of agendas in daily life. What is emphasized in the media, whether traditional or digital, has been found to have a profound impact on not only what people think, but the salience of the issues at any given point in time. This chapter reviews the theory behind agenda-setting and the variables that form, shape, and prime the public’s opinions, attitudes, and behaviors. The chapter also examines the new media and how it impacts on agenda-setting theory and research.

The effect of explanations and algorithmic accuracy on visual recommender systems of artistic images
Vicente Dominguez, Pablo Messina, Ivania Donoso-Guzmán, and Denis Parra. 2019. The effect of explanations and algorithmic accuracy on visual recommender systems of artistic images. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI '19). Association for Computing Machinery, New York, NY, USA, 408–416. DOI:

There are very few works about explaining content-based recommendations of images in the artistic domain. Current works do not provide a perspective of the many variables involved in the user perception of several aspects of the system such as domain knowledge, relevance, explainability, and trust. In this paper, we aim to fill this gap by studying three interfaces, with different levels of explainability, for artistic image recommendation. Our experiments with N=121 users confirm that explanations of recommendations in the image domain are useful and increase user satisfaction, perception of explainability and relevance. Furthermore, our results show that the observed effects are also dependent on the underlying recommendation algorithm used. We tested two algorithms: Deep Neural Networks (DNN), which has high accuracy, and Attractiveness Visual Features (AVF) with high transparency but lower accuracy. Our results indicate that algorithms should not be studied in isolation, but rather in conjunction with interfaces, since both play a significant role in the perception of explainability and trust for image recommendation. Finally, using the framework by Knijnenburg et al., we provide a comprehensive model which synthesizes the effects between different variables involved in the user experience with explainable visual recommender systems of artistic images.

Do Conditionalities Increase Support for Government Transfers?
Cesar Zucco, Juan Pablo Luna & O. Gokce Baykal (2020) Do Conditionalities Increase Support for Government Transfers?, The Journal of Development Studies, 56:3, 527-544, DOI: 10.1080/00220388.2019.1577388

Conditional Cash Transfers (CCTs) have spread through the developing world in the past two decades. It is often assumed that CCTs enjoy political support in the population precisely because they impose conditions on beneficiaries. This article employs survey experiments in Brazil and Turkey to determine whether, and in what contexts, making government transfers conditional on the behavior of beneficiaries increases political support for the programs. Results show that conditional transfers are only marginally more popular than similar unconditional transfers in nationally representative samples, but that this difference is substantially larger among the better-off and among those primed to think of themselves as different from beneficiaries. These findings imply that conditionalities per se are not as strong a determinant of support for transfers as the literature suggests, but that they can still be helpful in building support for transfers among subsets of the population that are least likely to support them.


Tag-based information access in image collections: insights from log and eye-gaze analyses
Lin, Y., Parra, D., Trattner, C. et al. Tag-based information access in image collections: insights from log and eye-gaze analyses. Knowl Inf Syst 61, 1715–1742 (2019).

Tag clouds have been utilized as a “social” way to find and visualize information, providing both one-click access and a snapshot of the “aboutness” of a tagged collection. While many research projects have explored and compared various tag artifacts using information theory and simulations, fewer studies have been conducted to compare the effectiveness of different tag-based browsing interfaces from the user’s point of view. This research aims to investigate how users utilize tags in image search context and to what extent different organizations of tag browsing interfaces are useful for image search. We conducted two experiments to explore user behavior and performance with three interfaces: two tag-enabled interfaces (the regular and faceted tag-clouds) and a baseline (search-only) interface. Our results demonstrate the value of tags in the image search context, the role of tags in the exploratory search, and the strengths of two kinds of tag organization explored in this paper.

Nowcasting earthquake damages with Twitter.
Mendoza, M., Poblete, B. & Valderrama, I. Nowcasting earthquake damages with Twitter. EPJ Data Sci. 8, 3 (2019).

The Modified Mercalli intensity scale (Mercalli scale for short) is a qualitative measure used to express the perceived intensity of an earthquake in terms of damages. Accurate intensity reports are vital to estimate the type of emergency response required for a particular earthquake. In addition, Mercalli scale reports are needed to estimate the possible consequences of strong earthquakes in the future, based on the effects of previous events. Emergency offices and seismological agencies worldwide are in charge of producing Mercalli scale reports for each affected location after an earthquake. However, this task relies heavily on human observers in the affected locations, who are not always available or accurate. Consequently, Mercalli scale reports may take up to hours or even days to be published after an earthquake. We address this problem by proposing a method for early prediction of spatial Mercalli scale reports based on people’s reactions to earthquakes in social networks. By tracking users’ comments about real-time earthquakes, we create a collection of Mercalli scale point estimates at municipality (i.e., state subdivisions) level granularity. We introduce the concept of reinforced Mercalli support, which combines Mercalli scale point estimates with locally supported data (named ‘local support’). We use this concept to provide Mercalli scale estimates for real-world events by providing smooth point estimates using a spatial smoother that incorporates the distribution of municipalities in each affected region. Our method is the first method based on social media that can provide spatial reports of damages in the Mercalli intensity scale. Experimental results show that our method is accurate and provides early spatial Mercalli reports 30 minutes after an earthquake. Furthermore, we show that our method performs well for earthquake spatial detection and maximum intensity prediction tasks. Our findings indicate that social media is a valuable source of spatial information for quickly estimating earthquake damages.

GraCT: A Grammar-based Compressed Index for Trajectory Data
Brisaboa, N. R., Gómez-Brandón, A., Navarro, G., & Paramá, J. R. (2019). GraCT: A Grammar-based Compressed Index for Trajectory Data. Information Sciences, 483, 106–135. DOI

We introduce a compressed data structure for the storage of free trajectories of moving objects that efficiently supports various spatio-temporal queries. Our structure, dubbed GraCT, stores the absolute positions of all the objects at regular time intervals (snapshots) using a k2-tree, which is a space- and time-efficient region quadtree. Positions between snapshots are represented as logs of relative movements and compressed using a grammar-based compressor. The non-terminals of this grammar are enhanced with MBR information to enable fast queries.

The GraCT structure of a dataset occupies less than the raw data compressed with a powerful traditional compressor. Further, instead of requiring full decompression to access the data like a traditional compressor, GraCT supports direct access to object trajectories or to their position at specific time instants, as well as spatial range and nearest-neighbor queries on time instants and/or time intervals.

Compared to traditional methods for storing and indexing spatio-temporal data, GraCT requires two orders of magnitude less space and is competitive in query times. In particular, thanks to its compressed representation, the GraCT structure may reside in main memory in situations where any classical uncompressed index must resort to disk, thereby being one or two orders of magnitude faster.


Interpretable Visual Question Answering by Visual Grounding From Attention Supervision Mining
Y. Zhang, J. C. Niebles and A. Soto, "Interpretable Visual Question Answering by Visual Grounding From Attention Supervision Mining," 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 2019, pp. 349-357, doi: 10.1109/WACV.2019.00043

A key aspect of visual question answering (VQA) models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.

Query rewriting for semantic query optimization in spatial databases
Mella, E., Rodríguez, M.A., Bravo, L. et al. Query rewriting for semantic query optimization in spatial databases. Geoinformatica 23, 79–104 (2019).

Query processing is an important challenge for spatial databases due to the use of complex data types that represent spatial attributes. In particular, due to the cost of spatial joins, several optimization algorithms based on indexing structures exist. The work in this paper proposes a strategy for semantic query optimization of spatial join queries. The strategy detects queries with empty results and rewrites queries to eliminate unnecessary spatial joins or to replace spatial by thematic joins. This is done automatically by analyzing the semantics imposed by the database schema through topological dependencies and topological referential integrity constraints. In this way, the strategy comes to complement current state-of-art algorithms for processing spatial join queries. The experimental evaluation with real data sets shows that the optimization strategy can achieve a decrease in the time cost of a join query using indexing structures in a spatial database management system (SDBMS).


Explaining subjective perceptions of public spaces as a function of the built environment: A massive data approach
Rossetti, T., Lobel, H., Rocco, V., & Hurtubia, R. (2019). Explaining subjective perceptions of public spaces as a function of the built environment: A massive data approach. Landscape and Urban Planning, 181, 169-178.

People’s perceptions of the built environment influence the way they use and navigate it. Understanding these perceptions may be useful to inform the design, management and planning process of public spaces. Recently, several studies have used data collected at a massive scale and machine learning methods to quantify these perceptions, showing promising results in terms of predictive performance. Nevertheless, most of these models can be of little help in understanding users’ perceptions due to the difficulty associated with identifying the importance of each attribute of landscapes. In this work, we propose a novel approach to quantify perceptions of landscapes through discrete choice models, using semantic segmentations of images of public spaces, generated through machine learning algorithms, as explanatory variables. The proposed models are estimated using the Place Pulse dataset, with over 1.2 million perceptual indicators, and are able to provide useful insights into how users perceive the built environment as a function of its features. The models obtained are used to infer perceptual variables in the city of Santiago, Chile, and show they have a significant correlation with socioeconomic indicators.

On the Turing Completeness of Modern Neural Network Architectures
Jorge Pérez, Javier Marinković, Pablo Barceló. ICLR 2019 Conference.

Alternatives to recurrent neural networks, in particular, architectures based on attention or convolutions, have been gaining momentum for processing input sequences. In spite of their relevance, the computational properties of these alternatives have not yet been fully explored. We study the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer (Vaswani et al., 2017) and the Neural GPU (Kaiser & Sutskever, 2016). We show both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external memory to become Turing complete. Our study also reveals some minimal sets of elements needed to obtain these completeness results.


Territorial sovereignty and the end of inter-cultural diplomacy along the “Southern frontier”
Carsten-Andreas Schulz. European Journal of International Relations.

European politics at the turn of the 19th century saw a dramatic reduction in the number and diversity of polities as the territorial nation-state emerged as the dominant form of political organization. The transformation had a profound impact on the periphery. The study examines how embracing the principle of territoriality transformed relations between settler societies and indigenous peoples in South America. As this shift coincided with independence from Spain, Creole elites rapidly dismantled the remnants of imperial heteronomy, ending centuries of inter-cultural diplomacy. The study illustrates this shift in the case of the “Southern frontier,” where Spain had maintained a practice of treaty making with the Mapuche people since the mid-17th century. This long-standing practice broke down shortly after Chile gained independence in 1818. What followed was a policy of coercive assimilation through military conquest and forced displacement — a policy that settler societies implemented elsewhere in the 19th century. In contrast to explanations that emphasize the spread of capitalist agriculture and racist ideologies, this study argues that territoriality spelled the end of inter-cultural diplomacy along the “Southern frontier.”


Competitiveness of a Non-Linear Block-Space GPU Thread Map for Simplex Domains
Cristobal A. Navarro, Matthieu Vernier, Benjamin Bustos, Nancy Hitschfeld. IEEE Trans. Parallel Distrib. Syst. 29(12): 2728-2741 (2018).

This work presents and studies the efficiency problem of mapping GPU threads onto simplex domains. A non-linear map λ(ω) is formulated based on a block-space enumeration principle that reduces the number of thread-blocks by a factor of approximately 2× and 6× for 2-simplex and 3-simplex domains, respectively, when compared to the standard approach. Performance results show that λ(ω) is competitive and even the fastest map when ran in recent GPU architectures such as the Tesla V100, where it reaches up to 1.5× of speedup in 2-simplex tests. In 3-simplex tests, it reaches up to 2.3× of speedup for small workloads and up to 1.25× for larger ones. The results obtained make λ(ω) a useful GPU optimization technique with applications on parallel problems that define all-pairs, all-triplets or nearest neighbors interactions in a 2-simplex or 3-simplex domain.


Profiling Graphs: Order from Chaos
Aidan Hogan. WWW (Companion Volume) 2018: 1481-1482.

Graphs are being increasingly adopted as a flexible data model in scenarios (e.g., Google’s Knowledge Graph, Facebook’s Graph API, Wikidata, etc.) where multiple editors are involved in content creation, where the schema is ever changing, where data are incomplete, where the connectivity of resources plays a key rolescenarios where relational models traditionally struggle. But with this flexibility comes a conceptual cost: it can be difficult to summarise and understand, at a high level, the content that a given graph contains. Hence profiling graphs becomes of increasing importance to extract order, a posteriori, from the chaotic processes by which such graphs are generated. This talk will motivate the use of graphs as a data model, abstract recent trends in graph data management, and then turn to the issue of profiling and summarising graphs: what are the goals of such profiling, the principles by which graphs can be summarised, the main techniques by which this can/could be achieved The talk will emphasise the importance of profiling graphs while highlighting a variety of open research questions yet to be tackled.


Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
Jan Van den Bussche, Marcelo Arenas: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018. ACM 2018.

Jan Van den Bussche, Marcelo Arenas: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018. ACM 2018.

This volume contains the proceedings of PODS 2018, which include a paper for the keynote addressed by Michael Benedikt (University of Oxford), abstracts based on two invited tutorials by Rajeev Raman (University of Leicester) and Arvind Narayanan (Princeton University), and 29 contributions that were selected by the Program Committee for presentation at the symposium.

In addition, this volume also contains papers from our two «Gems of PODS» speakers, Hung Ngo (Relational AI) and Phokion G. Kolaitis (UC Santa Cruz and IBM Research – Almaden). The Gems of PODS is an event, started in 2016, where the goal is to promote understanding of past seminal PODS results to the general audience. The Gems of PODS papers were selected by the Gems of PODS committee consisting of Marcelo Arenas (chair) (Pontificia Universidad Católica de Chile), Tova Milo (Tel Aviv University) and Dan Olteanu (Oxford University).

This year, PODS continued with two submission cycles that were introduced two years ago. The first cycle allowed for the possibility for papers to be revised and resubmitted. For the first cycle, 30 papers were submitted, 7 of which were directly selected for inclusion in the proceedings, and 5 were invited for a resubmission after a revision. The quality of most of the revised papers increased substantially with respect to the first submission, and 4 out of 5 revised papers were selected for the proceedings. For the second cycle, 53 papers were submitted, 18 of which were selected, resulting in 29 papers selected overall from a total number of 83 submissions.

An important task for the Program Committee has been the selection of the PODS 2018 Best Paper Award. The committee selected the paper: «Entity Matching with Active Monotone Classification» by Yufei Tao. On behalf of the committee, we would like to extend our sincere congratulations to the author!

Since 2008, PODS assigns the ACM PODS Alberto O. Mendelzon Test-of-Time Award to a paper or a small number of papers published in the PODS proceedings ten years prior that had the most impact over the intervening decade. This year’s committee, consisting of Maurizio Lenzerini, Wim Martens and Nicole Schweikardt, selected the following paper: «The Chase Revisited» by Alin Deutsch, Alan Nash and Jeff Remmel.


Organic Visualization of Document Evolution
Ignacio Pérez-Messina, Claudio Gutiérrez, Eduardo Graells-Garrido. IUI 2018: 497-501.

Recent availability of data about writing processes at keystroke-granularity has enabled research on the evolution of document writing. A natural task is to develop systems that can actually show this data, that is, user interfaces that transform the data of the process of writing –today a black box– into intelligible forms. On this line, we propose a data structure that captures a document’s fine-grained history and an organic visualization that serves as an interface to it. We evaluate a proof-of-concept implementation of the system through a pilot study using documents written by students at a public university. Our results are promising and reveal facets such as general strategies adopted, local edition density and hierarchical structure of the final text.


Modelling Dynamics in Semantic Web Knowledge Graphs with Formal Concept Analysis
Larry González, Aidan Hogan. WWW 2018: 1175-1184.

In this paper, we propose a novel data-driven schema for large-scale heterogeneous knowledge graphs inspired by Formal Concept Analysis (FCA). We first extract the sets of properties associated with individual entities; these property sets (aka. characteristic sets) are annotated with cardinalities and used to induce a lattice based on set-containment relations, forming a natural hierarchical structure describing the knowledge graph. We then propose an algebra over such schema lattices, which allows to compute diffs between lattices (for example, to summarise the changes from one version of a knowledge graph to another), to add lattices (for example, to project future changes), and so forth. While we argue that this lattice structure (and associated algebra) may have various applications, we currently focus on the use-case of modelling and predicting the dynamic behaviour of knowledge graphs. Along those lines, we instantiate and evaluate our methods for analysing how versions of the Wikidata knowledge graph have changed over a period of 11 weeks. We propose algorithms for constructing the lattice-based schema from Wikidata, and evaluate their efficiency and scalability. We then evaluate use of the resulting schema(ta) for predicting how the knowledge graph will evolve in future versions.



Querying APIs with SPARQL: Language and Worst-Case Optimal Algorithms
Matthieu Mosser, Fernando Pieressa, Juan L. Reutter, Adrián Soto, Domagoj Vrgoc. ESWC 2018: 639-654.

Although the amount of RDF data has been steadily increasing over the years, the majority of information on the Web is still residing in other formats, and is often not accessible to Semantic Web services. A lot of this data is available through APIs serving JSON documents. In this work we propose a way of extending SPARQL with the option to consume JSON APIs and integrate the obtained information into SPARQL query answers, thus obtaining a query language allowing to bring data from the “traditional” Web to the Semantic Web. Looking to evaluate these queries as efficiently as possible, we show that the main bottleneck is the amount of API requests, and present an algorithm that produces “worst-case optimal” query plans that reduce the number of requests as much as possible. We also do a set of experiments that empirically confirm the optimality of our approach.


An Introduction to Graph Data Management
Renzo Angles, Claudio Gutierrez. Graph Data Management 2018: 1-32.

Graph data management concerns the research and development of powerful technologies for storing, processing and analyzing large volumes of graph data. This chapter presents an overview about the foundations and systems for graph data management. Specifically, we present a historical overview of the area, studied graph database models, characterized essential graph-oriented queries, reviewed graph query languages, and explore the features of current graph data management systems (i.e. graph databases and graph-processing frameworks).


A compact representation for trips over networks built on self-indexes
Nieves R. Brisaboa, Antonio Fariña, Daniil Galaktionov, M. Andrea Rodríguez. Inf. Syst. 78: 1-22 (2018)

Representing the movements of objects (trips) over a network in a compact way while retaining the capability of exploiting such data effectively is an important challenge of real applications. We present a new Compact Trip Representation (CTR) that handles the spatio-temporal data associated with users’ trips over transportation networks. Depending on the network and types of queries, nodes in the network can represent intersections, stops, or even street segments.

CTR represents separately sequences of nodes and the time instants when users traverse these nodes. The spatial component is handled with a data structure based on the well-known Compressed Suffix Array (CSA), which provides both a compact representation and interesting indexing capabilities. The temporal component is self-indexed with either a Hu–Tucker-shaped Wavelet-Tree or a Wavelet Matrix that solve range-interval queries efficiently. We show how CTR can solve relevant counting-based spatial, temporal, and spatio-temporal queries over large sets of trips. Experimental results show the space requirements (around 50–70% of the space needed by a compact non-indexed baseline) and query efficiency (most queries are solved in the range of 1–1000 µs) of CTR.

Highlights: -We provide a representation for trips over networks and answer counting-based queries. -We adapt a Compressed Suffix Array to deal with the spatial component of trips. -We use a wavelet matrix or a Hu–tucker-shaped Wavelet Tree for the temporal component. -Experiments show space needs until a 50% when compared with a plain representation. -Experiments show counting-based query-times typically within 1–1000 µs.


Assessing the public transport travel behavior consistency from smart card data
Catalina Espinoza, Marcela Munizaga, Benjamín Bustos, Martin Trépanier. Transportation Research Procedia 32:44-53. Elsevier, 2018.

The aim of this research is to measure, using smart card data, how much do public transport users change their behavior through time. To quantify the change in behavior, we split the smart card records of a user into a set of separated time windows. Then, we measure the variation between each pair of time windows. Three algorithms that calculate the variation in users’ mobility are assessed. Using data from a Canadian transport system, we show that measuring the stability of user behavior at an individual level provides new insights for public transport operators, e.g., it can be used to measure users’ adaptability to changes in the transport system.


Efficacy and the Reproduction of Political Activism. Evidence from the Broad Front in Uruguay
Bentancur, V. P., Rodríguez, R. P., & Rosenblatt, F. (2019). Efficacy and the Reproduction of Political Activism: Evidence From the Broad Front in Uruguay. Comparative Political Studies, 52(6), 838–867.

The professionalization of politics and the disappearance of party organizations based on activists seems an inescapable trend. This article shows, by studying the Broad Front of Uruguay as a deviant case, the relevance of organizational rules for explaining the reproduction of party activism. Using data from both an online survey of people differing in their levels of engagement with the Broad Front and in-depth interviews with party activists, we show that those with relatively low levels of engagement—“adherents”—and activists differ in their willingness to cooperate with the party and in the time they devote to party activities. Also, we find that reducing the perceived efficacy of political engagement strongly decreases activists’ self-reported willingness to engage with the party, while this reduction has no effect upon adherents. These findings suggest that the design of organizational rules that grant a political role to grassroots organizers can promote party activism.


Using Compressed Suffix-Arrays for a compact representation of temporal-graphs
Nieves R. Brisaboa, Diego Caro, Antonio Fariña, M. Andrea Rodríguez. Inf. Sci. 465: 459-483 (2018)

Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal intervals when they are active. This work explores the use of the Compressed Suffix Array (CSA), a well-known compact and self-indexed data structure in the area of text indexing, to represent large temporal graphs. The new structure, called Temporal Graph CSA (TGCSA), is experimentally compared with the most competitive compact data structures in the state-of-the-art, namely, EdgeLog and CET. The experimental results show that TGCSA obtains a good space-time trade-off. It uses a reasonable space and is efficient for solving complex temporal queries. Furthermore, TGCSA has wider expressive capabilities than EdgeLog and CET, because it is able to represent temporal graphs where contacts on an edge can temporally overlap.

Highlights: -We consider the problem of representing temporal graphs in a compact way. -We can represent temporal graphs with contacts on a same edge that temporally overlap. -We design TGCSA and SHOW how it solves typical temporal queries. -We create a novel representation of Ψ that improves the performance of TGCSA. -We obtain a reasonable space/time tradeoff even on complex temporal queries.


GraFa: Faceted Search & Browsing for the Wikidata Knowledge Graph
José Moreno-Vega, Aidan Hogan. International Semantic Web Conference (P&D/Industry/BlueSky) 2018.

We present a demo of the GraFa faceted search and browsing interface over the Wikidata knowledge graph. We describe the key aspects of the interface, including the types of interactions that the system allows, the ranking schemes employed, and other features to aid usability. We also discuss future plans for improving the system. Online Demo:



DockerPedia: a Knowledge Graph of Docker Images
Maximiliano Osorio, Carlos Buil Aranda, Hernán Vargas. International Semantic Web Conference (P&D/Industry/BlueSky) 2018.

Docker is the most popular implementation of Operating System virtualization, currently its online registry service (Docker Hub) stores more than 4.5 millions of software images. Using that registry it is possible to download and deploy Docker images as software containers. However, these images only show information of the main software, hiding the dependencies needed to run it. To allow users to track what they deploy into their machines, we developed DockerPedia, a resource that publishes information of the packages within the Docker images as Linked Data.

Currently our resource includes 28% of the most downloaded images from Docker Hub providing information about the software dependencies and its vulnerabilities allowing to easily reproduce the environment in which each image was deployed as well as to check the security of the image without the need to download it.



QCan: Normalising Congruent SPARQL Queries
Jaime Salas, Aidan Hogan. International Semantic Web Conference (P&D/Industry/BlueSky) 2018.

We demonstrate a system to canonicalise (aka. normalise) SPARQL queries for use-cases such as caching, query log analysis, query minimisation, signing queries, etc. Our canonicalisation method deterministically rewrites a given input query to an equivalent canonical form such that the results for two queries are syntactically (string) equal if and only if they give the same  results on any database, modulo variable names. The method is sound and complete for a monotone fragment of SPARQL with  selection (equalities), projection, join and union under both set and bag semantics. Considering other SPARQL features (e.g., optional, filter, graph, etc.), the underlying equivalence problem becomes undecidable, where we currently rather support a best-effort canonicalisation for other SPARQL 1.0. features. We demonstrate a prototype of our canonicalisation framework, provide example rewritings, and discuss limitations, use-cases and future work. Demo link:


A Tractable Notion of Stratification for SHACL
Julien Corman, Juan L. Reutter, Ognjen Savkovic. International Semantic Web Conference (P&D/Industry/BlueSky) 2018.

One of the challenges of recent RDF-based applications is managing data quality [1], and several systems already provide RDF validation procedures (e.g., https://www., This created the need for a standardized declarative constraint language for RDF, and for mechanisms to detect violations of such constraints. An important step in this direction is SHACL, or Shapes Constraint Language ( which has become a W3C recommendation in 2017. The SHACL specification however leaves explicitly undefined the validation of recursive constraints. In a previous article [2], we showed that extending the specification’s semantics to accommodate for recursion leads to intractability (in the size of the graph) for the so-called “core constraint components” of SHACL. This result holds for stratified constraints already, which may come as a surprise, considering that stratification guarantees tractability in well-studied recursive languages such as Datalog. Our previous work identified a tractable fragment of SHACL’s core components. In this paper, we propose an alternative approach to gain tractability, retaining all SHACL operators, but strengthening the stratification condition traditionally used in logic programming. More exactly, we introduce a syntactic condition on shape constraints called “strict stratification”, which guarantees that graph validation is in PTIME in combined (i.e. graph and constraints) complexity. We also describe a procedure to perform such validation. The current paper is not self-contained, due to space limitations, but all definitions can be found in our previous article [2] or its online extended version [3].


Machine Translation vs. Multilingual Approaches for Entity Linking
Henry Rosales-Méndez, Aidan Hogan, Barbara Poblete. International Semantic Web Conference (P&D/Industry/BlueSky) 2018.

Entity Linking (EL) associates the entities mentioned in a given input text with their corresponding knowledge-base (KB) entries. A recent EL trend is towards multilingual approaches. However, one may ask: are multilingual EL approaches necessary with recent advancements in machine translation? Could we not simply focus on supporting one language in the EL system and translate the input text to that language? We present experiments along these lines comparing multilingual EL systems with their results over machine translated text.



End-to-End Joint Semantic Segmentation of Actors and Actions in Video
Jingwei Ji, Shyamal Buch, Alvaro Soto, Juan Carlos Niebles. ECCV (4) 2018: 734-749.

Traditional video understanding tasks include human action recognition and actor/object semantic segmentation. However, the combined task of providing semantic segmentation for different actor classes simultaneously with their action class remains a challenging but necessary task for many applications. In this work, we propose a new end-to-end architecture for tackling this task in videos. Our model effectively leverages multiple input modalities, contextual information, and multitask learning in the video to directly output semantic segmentations in a single unified framework. We train and benchmark our model on the Actor-Action Dataset (A2D) for joint actor-action semantic segmentation, and demonstrate state-of-the-art performance for both segmentation and detection. We also perform experiments verifying our approach improves performance for zero-shot recognition, indicating generalizability of our jointly learned feature space.



Towards Explanations for Visual Recommender Systems of Artistic Images
Vicente Domínguez, Pablo Messina, Christoph Trattner, Denis Parra. IntRS@RecSys 2018: 69-73.

Explaining automatic recommendations is an active area of research since it has shown an important e‚ect on users’ acceptance over the items recommended. However, there is a lack of research in explaining content-based recommendations of images based on visual features. In this paper, we aim to €ll this gap by testing three di‚erent interfaces (one baseline and two novel explanation interfaces) for artistic image recommendation. Our experiments with N=121 users con€rm that explanations of recommendations in the image domain are useful and increase user satisfaction, perception of explainability, relevance, and diversity. Furthermore, our experiments show that the results are also dependent on the underlying recommendation algorithm used. We tested the interfaces with two algorithms: Deep Neural Networks (DNN), with high accuracy but with dicult to explain features, and the more explainable method based on A‹ractiveness Visual Features (AVF). ‘e be‹er the accuracy performance –in our case the DNN method– the stronger the positive e‚ect of the explainable interface. Notably, the explainable features of the AVF method increased the perception of explainability but did not increase the perception of trust, unlike DNN, which improved both dimensions. ‘ese results indicate that algorithms in conjunction with interfaces play a signi€cant role in the perception of explainability and trust for image recommendation. We plan to further investigate the relationship between interface explainability and algorithmic performance in recommender systems.


New Structures to Solve Aggregated Queries for Trips over Public Transportation Networks
Nieves R. Brisaboa, Antonio Fariña, Daniil Galaktionov, Tirso V. Rodeiro, M. Andrea Rodríguez. SPIRE 2018: 88-101.

Representing the trajectories of mobile objects is a hot topic from the widespread use of smartphones and other GPS devices. However, few works have focused on representing trips over public transportation networks (buses, subway, and trains) where user’s trips can be seen as a sequence of stages performed within a vehicle shared with many other users. In this context, representing vehicle journeys reduces the redundancy because all the passengers inside a vehicle share the same arrival time for each stop. In addition, each vehicle journey follows exactly the sequence of stops corresponding to its line, which makes it unnecessary to represent that sequence for each journey.

To solve data management for transportation systems, we designed a conceptual model that gave us a better insight into this data domain and allowed us the definition of relevant terms and the detection of redundancy sources among those data. Then, we designed two compact representations focused on users’ trips (𝖳𝖳𝖢𝖳𝖱) and on vehicle trips (𝖠𝖼𝗎𝗆𝖬), respectively. Each approach owns some strengths and is able to answer some queries efficiently.

We include experimental results over synthetic trips generated from accurate schedules obtained from a real network description (from the bus transportation system of Madrid) to show the space/time trade-off of both approaches. We considered a wide range of different queries about the use of the transportation network such as counting-based/aggregate queries regarding the load of any line of the network at different times.



VoxEL: A Benchmark Dataset for Multilingual Entity Linking
Henry Rosales-Méndez, Aidan Hogan, Barbara Poblete. International Semantic Web Conference (2) 2018: 170-186.

The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with corresponding entities in a given knowledge base. While traditional EL approaches have largely focused on English texts, current trends are towards language-agnostic or otherwise multilingual approaches that can perform EL over texts in many languages. One of the obstacles to ongoing research on multilingual EL is a scarcity of annotated datasets with the same text in different languages. In this work we thus propose VoxEL: a manually-annotated gold standard for multilingual EL featuring the same text expressed in five European languages. We first motivate and describe the VoxEL dataset, using it to compare the behaviour of state of the art EL (multilingual) systems for five different languages, contrasting these results with those obtained using machine translation to English. Overall, our results identify how five state-of-the-art multilingual EL systems compare for various languages, how the results of different languages compare, and further suggest that machine translation of input text to English is now a competitive alternative to dedicated multilingual EL configurations.


Robust Detection of Extreme Events Using Twitter: Worldwide Earthquake Monitoring
Barbara Poblete, Jheser Guzman, Jazmine A. Maldonado Flores, Felipe A. Tobar. IEEE Trans. Multimedia 20(10): 2551-2561 (2018).

Timely detection and accurate description of extreme events, such as natural disasters and other crisis situations, are crucial for emergency management and mitigation. Extreme-event detection is challenging, since one has to rely upon reports from human observers appointed to specific geographical areas, or on an expensive and sophisticated infrastructure. In the case of earthquakes, geographically dense sensor networks are expensive to deploy and maintain. Therefore, only some regions-or even countries-are able to acquire useful information about the effects of earthquakes in their own territory. An inexpensive and viable alternative to this problem is to detect extreme real-world events through people’s reactions in online social networks. In particular, Twitter has gained popularity within the scientific community for providing access to real-time “citizen sensor” activity. Nevertheless, the massive amount of messages in the Twitter stream, along with the noise it contains, underpin a number of difficulties when it comes to Twitter-based event detection. We contribute to address these challenges by proposing an online method for detecting unusual bursts in discrete-time signals extracted from Twitter. This method only requires a one-off semisupervised initialization and can be scaled to track multiple signals in a robust manner. We also show empirically how our proposed approach, which was envisioned for generic event detection, can be adapted for worldwide earthquake detection, where we compare the proposed model to the state of the art for earthquake tracking using social media. Experimental results validate our approach as a competitive alternative in terms of precision and recall to leading solutions, with the advantage of implementation simplicity and worldwide scalability.



GraFa: Scalable Faceted Browsing for RDF Graphs
José Moreno-Vega, Aidan Hogan. International Semantic Web Conference (1) 2018: 301-317.

Faceted browsing has become a popular paradigm for user interfaces on the Web and has also been investigated in the context of RDF graphs. However, current faceted browsers for RDF graphs encounter performance issues when faced with two challenges: scale, where large datasets generate many results, and heterogeneity, where large numbers of properties and classes generate many facets. To address these challenges, we propose GraFa: a faceted browsing system for heterogeneous large-scale RDF graphs based on a materialisation strategy that performs an offline analysis of the input graph in order to identify a subset of the exponential number of possible facet combinations that are candidates for indexing. In experiments over Wikidata, we demonstrate that materialisation allows for displaying (exact) faceted views over millions of diverse results in under a second while keeping index sizes relatively small. We also present initial usability studies over GraFa.



A Multi-resolution Approximation for Time Series
Sanchez, H., Bustos, B. A Multi-resolution Approximation for Time Series. Neural Process Lett 52, 75–96 (2020).

Time series is a common and well-known way for describing temporal data. However, most of the state-of-the-art techniques for analysing time series have focused on generating a representation for a single level of resolution. For analysing of a time series at several levels of resolutions, one would require to compute different representations, one for each resolution level. We introduce a multi-resolution representation for time series based on local trends and mean values. We require the level of resolution as parameter, but it can be automatically computed if we consider the maximum resolution of the time series. Our technique represents a time series using trend-value pairs on each segment belonging to a resolution level. To provide a useful representation for data mining tasks, we also propose dissimilarity measures and a symbolic representation based on the SAX technique for efficient similarity search using a multi-resolution indexing scheme. We evaluate our method for classification and discord discovery tasks over a diversity of data domains, achieving a better performance in terms of efficiency and effectiveness compared with some of the best-known classic techniques. Indeed, for some of the experiments, the time series mining algorithms using our multi-resolution representation were an order of magnitude faster, in terms of distance computations, than the state of the art.

Do better ImageNet models transfer better… for image recommendation?
Felipe del-Rio, Pablo Messina, Vicente Dominguez, Denis Parra. CoRR abs/1807.09870 (2018)

Visual embeddings from Convolutional Neural Networks (CNN) trained on the ImageNet dataset for the ILSVRC challenge have shown consistently good performance for transfer learning and are widely used in several tasks, including image recommendation. However, some important questions have not yet been answered in order to use these embeddings for a larger scope of recommendation domains: a) Do CNNs that perform better in ImageNet are also better for transfer learning in content-based image recommendation?, b) Does fine-tuning help to improve performance? and c) Which is the best way to perform the fine-tuning?

In this paper we compare several CNN models pre-trained with ImageNet to evaluate their transfer learning performance to an artwork image recommendation task. Our results indicate that models with better performance in the ImageNet challenge do not always imply better transfer learning for recommendation tasks (e.g. NASNet vs. ResNet). Our results also show that fine-tuning can be helpful even with a small dataset, but not every fine-tuning works. Our results can inform other researchers and practitioners on how to train their CNNs for better transfer learning towards image recommendation systems.



On the Progression of Situation Calculus Universal Theories with Constants
Marcelo Arenas, Jorge A. Baier, Juan S. Navarro, Sebastian Sardiña. KR 2018: 484-493.

The progression of action theories is an important problem in knowledge representation. Progression is second-order definable and known to be first-order definable and effectively computable for restricted classes of theories. Motivated by the fact that universal theories with constants (UTCs) are expressive and natural theories whose satisfiability is decidable, in this paper we provide a thorough study of the progression of situation calculus UTCs. First, we prove that progression of a (possibly infinite) UTC is always first-order definable and results in a UTC. Though first-order definable, we show that the progression of a UTC may be infeasible, that is, it may result in an infinite UTC that is not equivalent to any finite set of first-order sentences. We then show that deciding whether %or not there is a feasible progression of a UTC is undecidable. Moreover, we show that deciding whether %or not a sentence (in an expressive fragment of first-order logic) is in the progression of a UTC is CONEXPTIME-complete, and that there exists a family of UTCs for which the size of every feasible progression grows exponentially. Finally, we discuss resolution-based approaches to compute the progression of a UTC. This comprehensive analysis contributes to a better understanding of progression in action theories, both in terms of feasibility and difficulty.


Certain Answers for SPARQL with Blank Nodes
Daniel Hernández, Claudio Gutierrez, Aidan Hogan. International Semantic Web Conference (1) 2018: 337-353.

Blank nodes in RDF graphs can be used to represent values known to exist but whose identity remains unknown. A prominent example of such usage can be found in the Wikidata dataset where, e.g., the author of Beowulf is given as a blank node. However, while SPARQL considers blank nodes in a query as existentials, it treats blank nodes in RDF data more like constants. Running SPARQL queries over datasets with unknown values may thus lead to counter-intuitive results, which may make the standard SPARQL semantics unsuitable for datasets with existential blank nodes. We thus explore the feasibility of an alternative SPARQL semantics based on certain answers. In order to estimate the performance costs that would be associated with such a change in semantics for current implementations, we adapt and evaluate approximation techniques proposed in a relational database setting for a core fragment of SPARQL. To further understand the impact that such a change in semantics may have on query solutions, we analyse how this new semantics would affect the results of user queries over Wikidata.


Canonicalisation of Monotone SPARQL Queries
Jaime Salas, Aidan Hogan. International Semantic Web Conference (1) 2018: 600-616.

Caching in the context of expressive query languages such as SPARQL is complicated by the difficulty of detecting equivalent queries: deciding if two conjunctive queries are equivalent is NP-complete, where adding further query features makes the problem undecidable. Despite this complexity, in this paper we propose an algorithm that performs syntactic canonicalisation of SPARQL queries such that the answers for the canonicalised query will not change versus the original. We can guarantee that the canonicalisation of two queries within a core fragment of SPARQL (monotone queries with select, project, join and union) is equal if and only if the two queries are equivalent; we also support other SPARQL features but with a weaker soundness guarantee: that the (partially) canonicalised query is equivalent to the input query. Despite the fact that canonicalisation must be harder than the equivalence problem, we show the algorithm to be practical for real-world queries taken from SPARQL endpoint logs, and further show that it detects more equivalent queries than when compared with purely syntactic methods. We also present the results of experiments over synthetic queries designed to stress-test the canonicalisation method, highlighting difficult cases.


Semantics and Validation of Recursive SHACL
Julien Corman, Juan L. Reutter, Ognjen Savkovic. International Semantic Web Conference (1) 2018: 318-336.

With the popularity of RDF as an independent data model came the need for specifying constraints on RDF graphs, and for mechanisms to detect violations of such constraints. One of the most promising schema languages for RDF is SHACL, a recent W3C recommendation. Unfortunately, the specification of SHACL leaves open the problem of validation against recursive constraints. This omission is important because SHACL by design favors constraints that reference other ones, which in practice may easily yield reference cycles.

In this paper, we propose a concise formal semantics for the so-called “core constraint components” of SHACL. This semantics handles arbitrary recursion, while being compliant with the current standard. Graph validation is based on the existence of an assignment of SHACL “shapes” to nodes in the graph under validation, stating which shapes are verified or violated, while verifying the targets of the validation process. We show in particular that the design of SHACL forces us to consider cases in which these assignments are partial, or, in other words, where the truth value of a constraint at some nodes of a graph may be left unknown.

Dealing with recursion also comes at a price, as validating an RDF graph against SHACL constraints is NP-hard in the size of the graph, and this lower bound still holds for constraints with stratified negation. Therefore we also propose a tractable approximation to the validation problem.


Faster and Smaller Two-Level Index for Network-Based Trajectories
Rodrigo Rivera, M. Andrea Rodríguez, Diego Seco. SPIRE 2018: 348-362.

Two-level indexes have been widely used to handle trajectories of moving objects that are constrained to a network. The top-level of these indexes handles the spatial dimension, whereas the bottom level handles the temporal dimension. The latter turns out to be an instance of the interval-intersection problem, but it has been tackled by non-specialized spatial indexes. In this work, we propose the use of a compact data structure on the bottom level of these indexes. Our experimental evaluation shows that our approach is both faster and smaller than existing solutions.



Copyless cost-register automata: Structure, expressiveness, and closure properties
Filip Mazowiecki, Cristian Riveros, Copyless cost-register automata: Structure, expressiveness, and closure properties, Journal of Computer and System Sciences, Volume 100, 2019, Pages 1-29, ISSN 0022-0000.

Cost register automata (CRA) and its subclass, copyless CRA, were recently proposed by Alur et al. as a new model for computing functions over strings. We study some structural properties, expressiveness, and closure properties of copyless CRA. We show that copyless CRA is strictly less expressive than weighted automata and is not closed under reverse operation. To find a better class we impose restrictions on copyless CRA, which ends successfully with a new robust computational model that is closed under reverse and other extensions.


Tree Path Majority Data Structures
Travis Gagie, Meng He, Gonzalo Navarro. CoRR abs/1806.01804 (2018).

We present the first solution to τ-majorities on tree paths. Given a tree of n nodes, each with a label from [1..σ], and a fixed threshold 0<τ<1, such a query gives two nodes u and v and asks for all the labels that appear more than τ|Puv| times in the path Puv from u to v, where |Puv| denotes the number of nodes in Puv. Note that the answer to any query is of size up to 1/τ. On a w-bit RAM, we obtain a linear-space data structure with O((1/τ)lognloglogwσ) query time. For any κ>1, we can also build a structure that uses O(nlog[κ]n) space, where log[κ]n denotes the function that applies logarithm κ times to n, and answers queries in time O((1/τ)loglogwσ). The construction time of both structures is O(nlogn). We also describe two succinct-space solutions with the same query time of the linear-space structure. One uses 2nH+4n+o(n)(H+1) bits, where Hlgσ is the entropy of the label distribution, and can be built in O(nlogn) time. The other uses nH+O(n)+o(nH) bits and is built in O(nlogn) time w.h.p.


Weighted Shortest Paths for RDF Graphs
Gonzalo Tartari, Aidan Hogan: WiSP. VOILA@ISWC 2018: 37-52.

An important aspect of exploratory search over graph data is to understand what paths connect a given pair of nodes. Since the resulting paths can be manifold, various works propose ranking paths likely to be of interest to a user; these methods often rely on enumerating all such paths (up to a fixed length or number) before ranking is applied. In this paper, we instead propose applying a shortest path search on weighted versions of the graph in order to directly compute the most relevant path(s) between two nodes without fixed-length bounds, further obviating the need to enumerate irrelevant paths. We investigate weightings based on node degree, PageRank and edge frequency, contrasting the paths produced by these schemes over the Wikidata graph and discussing performance issues. Finally we conduct a user study over Wikidata where evaluators assess the quality of the paths produced; though inter-rater consensus on which paths are of most interest is low, we achieve statistically significant results to suggest that users find the weighted shortest paths more interesting than the baseline shortest paths without weights.


Reproducibility of Computational Environments for Scientific Experiments using Container-based Virtualization
Maximiliano Osorio, Carlos Buil Aranda, Hernán Vargas. SemSci@ISWC 2018: 43-51.

Experiment reproducibility is the ability to run an experiment with the introduction of changes to it and getting results that are consistent with the original ones. To allow reproducibility, the scientific community encourages researchers to publish descriptions of the these experiments. However, these recommendations do not include an automated way for creating such descriptions: normally scientists have to annotate their experiments in a semi automated way. In this paper we propose a system to automatically describe computational environments used in in-silico experiments. We propose to use Operating System (OS) virtualization (containerization) for distributing software experiments throughout software images and an annotation system that will allow to describe these software images. The images are a minimal version of an OS (container) that allow the deployment of multiple isolated software packages within it.


Efficient Evaluation and Static Analysis for Well-Designed Pattern Trees with Projection
Pablo Barceló, Markus Kröll, Reinhard Pichler, Sebastian Skritek. ACM Trans. Database Syst. 43(2): 8:1-8:44 (2018).

Conjunctive queries (CQs) fail to provide an answer when the pattern described by the query does not exactly match the data. CQs might thus be too restrictive as a querying mechanism when data is semistructured or incomplete. The semantic web therefore provides a formalism—known as (projected) well-designed pattern trees (pWDPTs)—that tackles this problem: pWDPTs allow us to formulate queries that match parts of the query over the data if available, but do not ignore answers of the remaining query otherwise. Here we abstract away the specifics of semantic web applications and study pWDPTs over arbitrary relational schemas. Since the language of pWDPTs subsumes CQs, their evaluation problem is intractable. We identify structural properties of pWDPTs that lead to (fixed-parameter) tractability of various variants of the evaluation problem. We also show that checking if a pWDPT is equivalent to one in our tractable class is in 2EXPTIME. As a corollary, we obtain fixed-parameter tractability of evaluation for pWDPTs with such good behavior. Our techniques also allow us to develop a theory of approximations for pWDPTs.



Pumping Lemmas for Weighted Automata
Filip Mazowiecki, Cristian Riveros. STACS 2018: 50:1-50:14.

We present three pumping lemmas for three classes of functions definable by fragments of weighted automata over the min-plus semiring and the semiring of natural numbers. As a corollary we show that the hierarchy of functions definable by unambiguous, finitely-ambiguous, polynomiallyambiguous weighted automata, and the full class of weighted automata is strict for the minplus semiring.



Implicit Representation of Bigranular Rules for Multigranular Data
Stephen J. Hegner, M. Andrea Rodríguez. DEXA (1) 2018: 372-389.

Domains for spatial and temporal data are often multigranular in nature, possessing a natural order structure defined by spatial inclusion and time-interval inclusion, respectively. This order structure induces lattice-like (partial) operations, such as join, which in turn lead to join rules, in which a single domain element (granule) is asserted to be equal to, or contained in, the join of a set of such granules. In general, the efficient representation of such join rules is a difficult problem. However, there is a very effective representation in the case that the rule is bigranular; i.e., all of the joined elements belong to the same granularity, and, in addition, complete information about the (non)disjointness of all granules involved is known. The details of that representation form the focus of the paper.


Tactical distribution in local funding: The value of an aligned mayor
Bernardo Lara E., Sergio Toro M., Tactical distribution in local funding: The value of an aligned mayor, European Journal of Political Economy, Volume 56, 2019, Pages 74-89, ISSN 0176-2680

Using Chile as a case study for understanding tactical distribution under extensive controls on expenditure, this paper examines whether political motives affect the allocation of funds from the central government to localities. Collecting local-level data of two infrastructure funding programs and using the voting gap percentage between the coalition candidate and opposition competitors in a Sharp Regression Discontinuity methodology, we find causal evidence in favor of three hypotheses: (i) a coalition criterion influences the funding allocation to the local level; (ii) an electoral cycle exists in local funding; and (iii) the degree of coalition targeting varies based on a locality’s history of coalition alignment. In sum, the central government regards politically aligned mayors as valuable electoral assets, especially in municipalities historically aligned with the coalition.


First-Order Rewritability of Frontier-Guarded Ontology-Mediated Queries
Pablo Barceló, Gerald Berger, Carsten Lutz, Andreas Pieris. AMW 2018 Cambiar por versión IJCAI.

Ontology-based data access (OBDA) is a successful application of knowledge representation and reasoning technologies in information management systems. One premier goal is to facilitate access to data that is heterogeneous and incomplete. This is achieved via an ontology that enriches the user query, typically a union of conjunctive queries, with domain knowledge. It turned out that the ontology and the user query can be seen as two components of one composite query, called ontology-mediated query (OMQ).

The problem of answering OMQs is thus central to OBDA. There is a consensus that the required level of scalability in OMQ answering can be achieved by using standard database management systems. To this end, a standard approach used nowadays is query rewriting: the ontology O and the database query q are combined into a new query qO, the so-called rewriting, which gives the same answer as the OMQ consisting of O and q over all input databases. It is of course essential that the rewriting qO is expressed in a language that can be handled by standard database systems. The typical language that is considered is the class of first-order (FO) queries.

In this work, we focus on two central OMQ languages based on guarded and frontierguarded tuple-generating dependencies (TGDs), and we study the problem whether an OMQ is FO-rewritable, i.e, it can be equivalently expressed as a first-order query. Recall that a guarded (resp., frontier-guarded) TGD is a sentence of the form ∀x, ¯ y¯(φ(¯x, y¯) → ∃z ψ¯ (¯x, z¯)), where φ and ψ are conjunctions of relational atoms, and φ has an atom that contains all the variables (¯x ∪ y¯) (resp., x¯) [1, 8]. Our goal is to develop specially tailored techniques that allow us to understand the above non-trivial problem, and also to pinpoint its computational complexity. To this end, as we discuss below, we follow two different approaches. Our results can be summarized as follows:

-We first focus on the simpler OMQ language based on guarded TGDs and atomic queries, and, in Section 2, we provide a characterization of FO-rewritability that forms the basis for applying tree automata techniques.

-We then exploit, in Section 3, standard two-way alternating parity tree automata. In particular, we reduce our problem to the problem of checking the finiteness of the language of an automaton. The reduction relies on a refined version of the characterization of FO-rewritability established in Section 2. This provides a transparent solution to our problem based on standard tools, but it does not lead to an optimal result.

-Towards an optimal result, we use, in Section 4, a more sophisticated automata model, known as cost automata. In particular, we reduce our problem to the problem of checking the boundedness of a cost automaton. This allows us to show that FOrewritability for OMQs based on guarded TGDs and atomic queries is in 2EXPTIME, and in EXPTIME for predicates of bounded arity. The complexity analysis relies on an intricate result on the boundedness problem for a certain class of cost automata [5, 9].

-Finally, in Section 5, by using the results of Section 4, we provide a complete picture for the complexity of our problem, i.e., deciding whether an OMQ based on (frontier-)guarded TGDs and arbitrary (unions of) conjunctive queries is FO-rewritable.



First-Order Rewritability of Frontier-Guarded Ontology-Mediated Queries
Pablo Barceló, Gerald Berger, Carsten Lutz, Andreas Pieris. IJCAI 2018: 1707-1713.

We focus on ontology-mediated queries (OMQs) based on (frontier-)guarded existential rules and (unions of) conjunctive queries, and we investigate the problem of FO-rewritability, i.e., whether an OMQ can be rewritten as a first-order query. We adopt two different approaches. The first approach employs standard two-way alternating parity tree automata. Although it does not lead to a tight complexity bound, it provides a transparent solution based on widely known tools. The second approach relies on a sophisticated automata model, known as cost automata. This allows us to show that our problem is 2EXPTIME-complete. In both approaches, we provide semantic characterizations of FO-rewritability that are of independent interest.



SynKit: LTL Synthesis as a Service
Alberto Camacho, Christian J. Muise, Jorge A. Baier, Sheila A. McIlraith. IJCAI 2018: 5817-5819.

Automatic synthesis of software from specification is one of the classic problems in computer science. In the last decade, significant advances have been made in the synthesis of programs from specifications expressed in Linear Temporal Logic (LTL). LTL synthesis technology is central to a myriad of applications from the automated generation of controllers for Internet of Things devices, to the synthesis of control software for robotic applications. Unfortunately, the number of existing tools for LTL synthesis is limited, and using them requires specialized expertise. In this paper we present SynKit, a tool that offers LTL synthesis as a service. SynKit integrates a RESTful API and a web service with an editor, a solver, and a strategy visualizer.


A Model of Distributed Query Computation in Client-Server Scenarios on the Semantic Web
Olaf Hartig, Ian Letter, Jorge Pérez. IJCAI 2018: 5259-5263.

This paper provides an overview of a model for capturing properties of client-server-based query computation setups. This model can be used to formally analyze different combinations of client and server capabilities, and compare them in terms of various fine-grain complexity measures. While the motivations and the focus of the presented work are related to querying the Semantic Web, the main concepts of the model are general enough to be applied in other contexts as well.


LTL Realizability via Safety and Reachability Games
Alberto Camacho, Christian J. Muise, Jorge A. Baier, Sheila A. McIlraith. IJCAI 2018: 4683-4691.

In this paper, we address the problem of LTL realizability and synthesis. State of the art techniques rely on so-called bounded synthesis methods, which reduce the problem to a safety game. Realizability is determined by solving synthesis in a dual game. We provide a unified view of duality, and introduce novel bounded realizability methods via reductions to reachability games. Further, we introduce algorithms, based on AI automated planning, to solve these safety and reachability games. This is the the first complete approach to LTL realizability and synthesis via automated planning. Experiments illustrate that reductions to reachability games are an alternative to reductions to safety games, and show that planning can be a competitive approach to LTL realizability and synthesis.


A More General Theory of Static Approximations for Conjunctive Queries
Pablo Barceló, Miguel Romero, Thomas Zeume. ICDT 2018: 7:1-7:22.

Conjunctive query (CQ) evaluation is NP-complete, but becomes tractable for fragments of bounded hypertreewidth. If a CQ is hard to evaluate, it is thus useful to evaluate an approximation of it in such fragments. While underapproximations (i.e., those that return correct answers only) are well-understood, the dual notion of overapproximations that return complete (but not necessarily sound) answers, and also a more general notion of approximation based on the symmetric difference of query results, are almost unexplored. In fact, the decidability of the basic problems of evaluation, identification, and existence of those approximations, is open. We develop a connection with existential pebble game tools that allows the systematic study of such problems. In particular, we show that the evaluation and identification of overapproximations can be solved in polynomial time. We also make progress in the problem of existence of overapproximations, showing it to be decidable in 2EXPTIME over the class of acyclic CQs. Furthermore, we look at when overapproximations do not exist, suggesting that this can be alleviated by using a more liberal notion of overapproximation. We also show how to extend our tools to study symmetric difference approximations. We observe that such approximations properly extend under- and over-approximations, settle the complexity of its associated identification problem, and provide several results on existence and evaluation.


G-CORE: A Core for Future Graph Query Languages
Renzo Angles, Marcelo Arenas, Pablo Barceló, Peter A. Boncz, George H. L. Fletcher, Claudio Gutierrez, Tobias Lindaaker, Marcus Paradies, Stefan Plantikow, Juan F. Sequeda, Oskar van Rest, Hannes Voigt. SIGMOD Conference 2018: 1421-1432.

We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.



Finite LTL Synthesis as Planning
Alberto Camacho, Jorge A. Baier, Christian J. Muise, Sheila A. McIlraith. ICAPS 2018: 29-38.

LTL synthesis is the task of generating a strategy that satisfies a Linear Temporal Logic (LTL) specification interpreted over infinite traces. In this paper we examine the problem of LTLf synthesis, a variant of LTL synthesis where the specification of the behaviour of the strategy we generate is interpreted over finite traces — similar to the assumption we make in many planning problems, and important for the synthesis of business processes and other system interactions of finite duration. Existing approaches to LTLf synthesis transform LTLf into deterministic finite-state automata (DFA) and reduce the synthesis problem to a DFA game. Unfortunately, the DFA transformation is worst-case double-exponential in the size of the formula, presenting a computational bottleneck. In contrast, our approach exploits non-deterministic automata, and we reduce the synthesis problem to a non-deterministic planning problem. We leverage our approach not only for strategy generation but also to generate certificates of unrealizability — the first such method for LTLf. We employ a battery of techniques that exploit the structure of the LTLf specification to improve the efficiency of our transformation to automata. We combine these techniques with lazy determinization of automata and on-the-fly state abstraction. We illustrate the effectiveness of our approach on a set of established LTL synthesis benchmarks adapted to finite LTL.


Containment for Rule-Based Ontology-Mediated Queries
Pablo Barceló, Gerald Berger, Andreas Pieris. PODS 2018: 267-279.

Many efforts have been dedicated to identifying restrictions on ontologies expressed as tuple-generating dependencies (tgds), a.k.a. existential rules, that lead to the decidability of answering ontology-mediated queries (OMQs). This has given rise to three families of formalisms: guarded, non-recursive, and sticky sets of tgds. We study the containment problem for OMQs expressed in such formalisms, which is a key ingredient for solving static analysis tasks associated with them. Our main contribution is the development of specially tailored techniques for OMQ containment under the classes of tgds stated above. This enables us to obtain sharp complexity bounds for the problems at hand.


Constant Delay Algorithms for Regular Document Spanners
Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, Domagoj Vrgoc. PODS 2018: 165-177.

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner’s output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results.


Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity
Francisco Maturana, Cristian Riveros, Domagoj Vrgoc. PODS 2018: 125-136.

Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, all language proposals introduced so far are designed to output relations, thus making them incapable of handling incomplete information. To remedy the situation, we propose to extend information extraction languages with the ability to use mappings, thus allowing us to work with documents which have missing or optional parts. Using this approach, we simplify the semantics of regex formulas and extraction rules, two previously defined methods for extracting information. We extend them with the ability to handle incomplete data, and study how they compare in terms of expressive power. We also study computational properties of these languages, focusing on the query enumeration problem, as well as satisfiability and containment.


Extracting semantic knowledge from web context for multimedia IR: a taxonomy, survey and challenges
Teresa Bracamonte, Benjamin Bustos, Barbara Poblete, Tobias Schreck. Multimedia Tools Appl. 77(11): 13853-13889 (2018).

Since its invention, the Web has evolved into the largest multimedia repository that has ever existed. This evolution is a direct result of the explosion of user-generated content, explained by the wide adoption of social network platforms. The vast amount of multimedia content requires effective management and retrieval techniques. Nevertheless, Web multimedia retrieval is a complex task because users commonly express their information needs in semantic terms, but expect multimedia content in return. This dissociation between semantics and content of multimedia is known as the semantic gap. To solve this, researchers are looking beyond content-based or text-based approaches, integrating novel data sources. New data sources can consist of any type of data extracted from the context of multimedia documents, defined as the data that is not part of the raw content of a multimedia file. The Web is an extraordinary source of context data, which can be found in explicit or implicit relation to multimedia objects, such as surrounding text, tags, hyperlinks, and even in relevance-feedback. Recent advances in Web multimedia retrieval have shown that context data has great potential to bridge the semantic gap. In this article, we present the first comprehensive survey of context-based approaches for multimedia information retrieval on the Web. We introduce a data-driven taxonomy, which we then use in our literature review of the most emblematic and important approaches that use context-based data. In addition, we identify important challenges and opportunities, which had not been previously addressed in this area.



Report on the 2nd International Workshop on Recent Trends in News Information Retrieval (NewsIR’18)
Dyaa Albakour, David Corney, Julio Gonzalo, Miguel Martínez, Bárbara Poblete, Andreas Vlachos: Report on the 2nd International Workshop on Recent Trends in News Information Retrieval (NewsIR'18). SIGIR Forum 52(1): 140-146 (2018).

The news industry has undergone a revolution in the past decade, with substantial changes continuing to this day. News consumption habits are changing due to the increase in the volume of news and the variety of sources. Readers need new mechanisms to cope with this vast volume of information in order to not only find a signal in the noise, but also to understand what is happening in the world given the multiple points of view describing events. These challenges in journalism relate to Information Retrieval (IR) and Natural Language Processing (NLP) fields such as: verification of a source’s reliability; the integration of news with other sources of information; real-time processing of both news content and social streams; de-duplication of stories; and entity detection and disambiguation. Although IR and NLP have been applied to news for decades, the changing nature of the space requires fresh approaches and a closer collaboration with our colleagues from the journalism environment. Following the success of the previous version of the workshop (NewsIR’16), the goal of this workshop, held in conjunction with ECIR 2018, is to continue to stimulate such discussion between the communities and to share interesting approaches to solve real user problems. A total number of 19 submissions were received and reviewed, of which 12 were accepted for presentation. In addition to that, we had over 30 registered participants in the workshop who were pleased to attend the two keynote talks given by well-known experts in the field – Edgar Meij (from industry) and Peter Tolmie (from academia) and oral and poster presentations from the accepted papers. The workshop also included a breakout session to discuss ideas for a future data challenge in news IR and closed with a focused panel discussion to reflect on the day. In summary, several ideas were presented in the workshop on solving complex information needs in the news domain. In addition, the workshop concluded with suggestions of important challenges and shared tasks to work on as a community for News IR.


Early Tracking of People’s Reaction in Twitter for Fast Reporting of Damages in the Mercalli Scale
Marcelo Mendoza, Bárbara Poblete, Ignacio Valderrama. HCI International 2018 (14) 2018: 247-257.

The Modified Mercalli Intensity Scale is a measure of the severity of an earthquake for a nonscientist. Since the Mercalli scale is based on perceived effects, it has a strong dependence on observers. Typically, these reports take time to be prepared and, as a consequence, Mercalli intensities are published hours after the occurrence of an earthquake. The National Seismological Center of Chile needs to provide a preliminary overview of the observed effects of an earthquake. This has motivated us to create a system for early tracking of people’s reaction in social networks to infer Mercalli intensities. By tracking people’s comments about the effects of an earthquake, a collection of Mercalli point estimates is retrieved at county level of granularity. We introduce the concept of Reinforced Mercalli support that combines Mercalli point estimates with social support, allowing to discard social unsupported estimates. Experimental results show that our proposal is accurate providing early Mercalli reports 30 min after an earthquake, detecting the maximum Mercalli intensity of an event with high accuracy in terms of mean absolute error (MAE).



Domain-Independent Detection of Emergency Situations Based on Social Activity Related to Geolocations
Hernán Sarmiento, Bárbara Poblete, Jaime Campos. WebSci 2018: 245-254.

In general, existing methods for automatically detecting emergency situations using Twitter rely on features based on domain-specific keywords found in messages. This type of keyword-based methods usually require training on domain-specific labeled data, using multiple languages, and for different types of events (e.g., earthquakes, floods, wildfires, etc.). In addition to being costly, these approaches may fail to detect previously unexpected situations, such as uncommon catastrophes or terrorist attacks. However, collective mentions of certain keywords are not the only type of self-organizing phenomena that may arise in social media when a real-world extreme situation occurs. Just as nearby physical sensors become activated when stimulated, localized citizen sensors (i.e., users) will also react in a similar manner. To leverage this information, we propose to use self-organized activity related to geolocations to identify emergency situations. We propose to detect such events by tracking the frequencies, and probability distributions of the interarrival time of the messages related to specific locations. Using an off-the-shelf classifier that is independent of domain-specific features, we study and describe emergency situations based solely on location-based features in messages. Our findings indicate that anomalies in location-related social media user activity indeed provide information for automatically detecting emergency situations independent of their domain.



What Should Entity Linking link?
Henry Rosales-Méndez, Barbara Poblete, Aidan Hogan. AMW 2018.

Some decades have passed since the concept of “named entity” was used for the first time. Since then, new lines of research have emerged in this environment, such as linking the (named) entity mentions in a text collection with their corresponding knowledge-base entries.

However, this introduces problems with respect to a consensus on the definition of the concept of “entity” in the literature. This paper aims to highlight the importance of formalizing the concept of “entity” and the benefits it would bring to the Entity Linking community, in particular relating to the construction of gold standards for evaluation purposes.


Towards a Robust Semantics for SHACL: Preliminary Discussion
Julien Corman, Juan L. Reutter, Ognjen Savkovic. AMW 2018.

Validating RDF graphs against constraints has gained interest in recent years, due to the popularity of RDF and the growth of knowledge bases. SHACL, a constraint language for RDF, has recently become a W3C recommendation, with a specification detailing syntax, semantics and common use cases. Unfortunately, this (otherwise complete) specification does not cover validation against recursive constraints. This omission is important, because SHACL by design favors constraint references. We investigate the possibility of a formal semantics for SHACL which covers the recursive case, while being compliant with the current standard.


The Property Graph Database Model
Renzo Angles. AMW 2018.

Most of the current graph database systems have been designed to support property graphs. Surprisingly, there is no standard specification of the database model behind such systems. This paper presents a formal definition of the property graph database model. Specifically, we define the property graph data structure, basic notions of integrity constraints (e.g. graph schema), and a graph query language.


A Simple, Efficient, Parallelizable Algorithm for Approximated Nearest Neighbors
Sebastián Ferrada, Benjamin Bustos, Nora Reyes. AMW 2018.

The use of the join operator in metric spaces leads to what is known as a similarity join, where objects of two datasets are paired if they are somehow similar. We propose an heuristic that solves the 1-NN selfsimilarity join, that is, a similarity join of a dataset with itself, that brings together each element with its nearest neighbor within the same dataset. Solving the problem using a simple brute-force algorithm requires O(n 2 ) distance calculations, since it requires to compare every element against all others. We propose a simple divide-and-conquer algorithm that gives an approximated solution for the self-similarity join that computes only O(n 3 2 ) distances. We show how the algorithm can be easily modified in order to improve the precision up to 31% (i.e., the percentage of correctly found 1-NNs) and such that 79% of the results are within the 10-NN, with no significant extra distance computations. We present how the algorithm can be executed in parallel and prove that using Θ( √ n) processors, the total execution takes linear time. We end discussing ways in which the algorithm can be improved in the future.


Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management
Dan Olteanu, Barbara Poblete. Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management, Cali, Colombia, May 21-25, 2018. CEUR Workshop Proceedings 2100, 2018.

AMW 2018, Alberto Mendelzon Workshop on Foundations of Data Management.

Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management Cali, Colombia, May 21-25, 2018.

Edited by
Dan Olteanu, University of Oxford, UK
Bárbara Poblete, University of Chile, Chile.


A Data-Driven Graph Schema
Larry González, Aidan Hogan. AMW 2018.

Graph-based data models [1] have become increasingly common in data management scenarios that require flexibility beyond what is offered by traditional relational databases. Such flexibility is particularly important in Web scenarios, where potentially many users may be involved (either directly or indirectly) in the creation, management, and curation of data. An example of such a scenario is the Wikidata knowledge graph [2] where users can add new properties and types that can be used to define further data. The flip-side of flexibility is higher levels of heterogeneity. Conceptually understanding the current state of a knowledge graph – in terms of what data it contains, what it is missing, how it can be effectively queried, what has changed recently, etc. – is thus a major challenge: it is unclear how to distil an adequate, high-level description that captures an actionable overview of knowledge graphs. We thus need well-founded methodologies to make sense of knowledge graphs, where an obvious approach is to define some notion of schema for such graphs.

The traditional approach in the Semantic Web has been what Pham and Boncz [3] call the schema first approach, which defines the schema that the data should follow. The most established language for specifying such schemas is RDFS. An alternative to the schema first approach is the schema last approach [3], which foregoes an upfront schema and rather lets the data evolve naturally; thereafter, the goal is to understand what the legacy graph data contain by extracting highlevel summaries that characterise the graph, resulting in a data-driven schema. In this paper, we summarise recently published results [4] on a novel approach to compute a data-driven schema from knowledge graphs. We believe that such schemas are useful for understanding what a knowledge graph contains, and how it can be queried, among several other use-cases. Nevertheless, in this work we focus on the use-case of predicting how the knowledge graph will evolve in future versions, which could be used for measuring the time-to-live of cached SPARQL results, identifying missing properties for entities, etc.


The Data Readiness Problem for Relational Databases
Rada Chirkova, Jon Doyle, Juan L. Reutter. AMW 2018.

We consider the problem of determining whether organizations facing a new data-transformation task can avoid building a new transformation procedure from scratch by reusing their stored procedures. Because it can be difficult to obtain exact descriptions of what stored procedures do, our framework abstracts data-transforming tools as black-box procedures, in which a procedure description indicates the parts of the database that might be modified by the procedure and constraints on the states of the database that must hold before and after the application of this procedure.

In this paper we present our framework and study the problem of determining, given a database and a set of procedures, whether there is a sequence of procedures from this set such that their application to the database results in the satisfaction of a boolean query. This data readiness problem is undecidable in general, but we show decidability for a broad and realistic class of procedures.


Semantics and Complexity of GraphQL
Olaf Hartig, Jorge Pérez. WWW 2018: 1155-1164.

GraphQL is a recently proposed, and increasingly adopted, conceptual framework for providing a new type of data access interface on the Web. The framework includes a new graph query language whose semantics has been specified informally only. This has prevented the formal study of the main properties of the language. We embark on the formalization and study of GraphQL. To this end, we first formalize the semantics of GraphQL queries based on a labeled-graph data model. Thereafter, we analyze the language and show that it admits really efficient evaluation methods. In particular, we prove that the complexity of the GraphQL evaluation problem is NL-complete. Moreover, we show that the enumeration problem can be solved with constant delay. This implies that a server can answer a GraphQL query and send the response byte-by-byte while spending just a constant amount of time between every byte sent. Despite these positive results, we prove that the size of a GraphQL response might be prohibitively large for an internet scenario. We present experiments showing that current practical implementations suffer from this issue. We provide a solution to cope with this problem by showing that the total size of a GraphQL response can be computed in polynomial time. Our results on polynomial-time size computation plus the constant-delay enumeration can help developers to provide more robust GraphQL interfaces on the Web.


Querying Wikimedia Images using Wikidata Facts
Sebastián Ferrada, Nicolás Bravo, Benjamín Bustos, Aidan Hogan. WWW (Companion Volume) 2018: 1815-1821.

Despite its importance to the Web, multimedia content is often neglected when building and designing knowledge-bases: though descriptive metadata and links are often provided for images, video, etc., the multimedia content itself is often treated as opaque and is rarely analysed. IMGpedia is an effort to bring together the images of Wikimedia Commons (including visual information), and relevant knowledge-bases such as Wikidata and DBpedia. The result is a knowledge-base that incorporates similarity relations between the images based on visual descriptors, as well as links to the resources of Wikidata and DBpedia that relate to the image. Using the IMGpedia SPARQL endpoint, it is then possible to perform visuo-semantic queries, combining the semantic facts extracted from the external resources and the similarity relations of the images. This paper presents a new web interface to browse and explore the dataset of IMGpedia in a more friendly manner, as well as new visuo-semantic queries that can be answered using 6 million recently added links from IMGpedia to Wikidata. We also discuss future directions we foresee for the IMGpedia project.


Building Knowledge Maps of Web Graphs
Valeria Fionda, Giuseppe Pirrò, Claudio Gutierrez. WWW (Companion Volume) 2018: 479-482.

We research the problem of building knowledge maps of graph-like information. We live in the digital era and similarly to the Earth, the Web is simply too large and its interrelations too complex for anyone to grasp much of it through direct observation. Thus, the problem of applying cartographic principles also to digital landscapes is intriguing. We introduce a mathematical formalism that captures the general notion of map of a graph and enables its development and manipulation in a semi-automated way. We describe an implementation of our formalism on the Web of Linked Data graph and discuss algorithms that efficiently generate and combine (via an algebra) regions and maps. Finally, we discuss examples of knowledge maps built with a tool implementing our framework.


Workshop on Linked Data on the Web co-located with The Web Conference 2018
Tim Berners-Lee, Sarven Capadisli, Stefan Dietze, Aidan Hogan, Krzysztof Janowicz, Jens Lehmann. LDOW@WWW 2018, Lyon, France April 23rd, 2018. CEUR Workshop Proceedings 2073, 2018.

Tim Berners-Lee, Sarven Capadisli, Stefan Dietze, Aidan Hogan, Krzysztof Janowicz, Jens Lehmann. LDOW@WWW 2018, Lyon, France April 23rd, 2018. CEUR Workshop Proceedings 2073, 2018.


Synthesizing Controllers: On the Correspondence Between LTL Synthesis and Non-deterministic Planning
Alberto Camacho, Jorge A. Baier, Christian J. Muise, Sheila A. McIlraith. Canadian Conference on AI 2018: 45-59.

Linear Temporal Logic ( 𝖫𝖳𝖫 ) synthesis can be understood as the problem of building a controller that defines a winning strategy, for a two-player game against the environment, where the objective is to satisfy a given 𝖫𝖳𝖫 formula. It is an important problem with applications in software synthesis, including controller synthesis. In this paper we establish the correspondence between 𝖫𝖳𝖫 synthesis and fully observable non-deterministic (FOND) planning. We study 𝖫𝖳𝖫 interpreted over both finite and infinite traces. We also provide the first explicit compilation that translates an 𝖫𝖳𝖫 synthesis problem to a FOND problem. Experiments with state-of-the-art 𝖫𝖳𝖫 FOND and synthesis solvers show automated planning to be a viable and effective tool for highly structured 𝖫𝖳𝖫 synthesis problems.


Automatically Generating Wikipedia Info-boxes from Wikidata
Tomás Sáez, Aidan Hogan. WWW (Companion Volume) 2018: 1823-1830.

Info-boxes provide a summary of the most important meta-data relating to a particular entity described by a Wikipedia article. However, many articles have no info-box or have info-boxes with only minimal information; furthermore, there is a huge disparity between the level of detail available for info-boxes in English articles and those for other languages. Wikidata has been proposed as a central repository of facts to try to address such disparities, and has been used as a source of information to generate info-boxes. However, current processes still rely on human intervention either to create generic templates for entities of a given type or to create a specific info-box for a specific article in a specific language. As such, there are still many articles of Wikipedia without info-boxes but where relevant data are provided by Wikidata. In this paper, we investigate fully automatic methods to generate info-boxes for Wikipedia from the Wikidata knowledge graph. The primary challenge is to create ranking mechanisms that provide an intuitive prioritisation of the facts associated with an entity. We discuss this challenge, propose several straightforward metrics to prioritise information in info-boxes, and present an initial user evaluation to compare the quality of info-boxes generated by various metrics.


TriAL: A Navigational Algebra for RDF Triplestores
Leonid Libkin, Juan L. Reutter, Adrián Soto, Domagoj Vrgoc. ACM Trans. Database Syst. 43(1): 5:1-5:46 (2018).

Navigational queries over RDF data are viewed as one of the main applications of graph query languages, and yet the standard model of graph databases—essentially labeled graphs—is different from the triples-based model of RDF. While encodings of RDF databases into graph data exist, we show that even the most natural ones are bound to lose some functionality when used in conjunction with graph query languages. The solution is to work directly with triples, but then many properties taken for granted in the graph database context (e.g., reachability) lose their natural meaning.

Our goal is to introduce languages that work directly over triples and are closed, i.e., they produce sets of triples, rather than graphs. Our basic language is called TriAL, or Triple Algebra: it guarantees closure properties by replacing the product with a family of join operations. We extend TriAL with recursion and explain why such an extension is more intricate for triples than for graphs. We present a declarative language, namely a fragment of datalog, capturing the recursive algebra. For both languages, the combined complexity of query evaluation is given by low-degree polynomials. We compare our language with previously studied graph query languages such as adaptations of XPath, regular path queries, and nested regular expressions; many of these languages are subsumed by the recursive triple algebra. We also provide an implementation of recursive TriAL on top of a relational query engine, and we show its usefulness by running a wide array of navigational queries over real-world RDF data, while at the same time testing how our implementation compares to existing RDF systems.


Characterising RDF data sets
Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente Redondo, Claudio Gutiérrez. J. Information Science 44(2): 203-229 (2018).

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.


A Graph-Based Approach for Querying Protein-Ligand Structural Patterns
Renzo Angles, Mauricio Arenas. IWBBIO (1) 2018: 235-244.

In the context of protein engineering and biotechnology, the discovery and characterization of structural patterns is very relevant as it can give fundamental insights about protein structures. In this paper we present GSP4PDB, a bioinformatics web tool that lets the users design, search and analyze protein-ligand structural patterns inside the Protein Data Bank (PDB). The novel feature of GSP4PDB is that a protein-ligand structural pattern is graphically designed as a graph such that the nodes represent protein’s components and the edges represent structural relationships. The resulting graph pattern is transformed into a SQL query, and executed in a PostgreSQL database system where the PDB data is stored. The results of the search are presented using a textual representation, and the corresponding binding-sites can be visualized using a JSmol interface.


Learning to Leverage Microblog Information for QA Retrieval
Jose Miguel Herrera, Barbara Poblete, Denis Parra. ECIR 2018: 507-520.

Community Question Answering (cQA) sites have emerged as platforms designed specifically for the exchange of questions and answers among users. Although users tend to find good quality answers in cQA sites, they also engage in a significant volume of QA interactions in other platforms, such as microblog networking sites. This in part is explained because microblog platforms contain up-to-date information on current events, provide rapid information propagation, and have social trust.

Despite the potential of microblog platforms, such as Twitter, for automatic QA retrieval, how to leverage them for this task is not clear. There are unique characteristics that differentiate Twitter from traditional cQA platforms (e.g., short message length, low quality and noisy information), which do not allow to directly apply prior findings in the area. In this work, we address this problem by studying: (1) the feasibility of Twitter as a QA platform and (2) the discriminating features that identify relevant answers to a particular query. In particular, we create a document model at conversation-thread level, which enables us to aggregate microblog information, and set up a learning-to-rank framework, using factoid QA as a proxy task. Our experimental results show microblog data can indeed be used to perform QA retrieval effectively. We identify domain-specific features and combinations of those features that better account for improving QA ranking, achieving a MRR of 0.7795 (improving 62% over our baseline method). In addition, we provide evidence that our method allows to retrieve complex answers to non-factoid questions.



Proceedings of the Second International Workshop on Recent Trends in News Information Retrieval co-located with 40th European Conference on Information Retrieval (ECIR 2018)
Dyaa Albakour, David Corney, Julio Gonzalo, Miguel Martinez, Barbara Poblete, Andreas Valochas. CEUR Workshop Proceedings 2079, 2018.

Dyaa Albakour, David Corney, Julio Gonzalo, Miguel Martinez, Barbara Poblete, Andreas Valochas: Proceedings of the Second International Workshop on Recent Trends in News Information Retrieval co-located with 40th European Conference on Information Retrieval (ECIR 2018), Grenoble, France, March 26, 2018. CEUR Workshop Proceedings 2079, 2018.


Graph Query Languages
Angles R., Reutter J., Voigt H. (2018) Graph Query Languages. In: Sakr S., Zomaya A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham.

In the Encyclopedia of Big Data Technologies:

A query language is a high-level computer language for the retrieval and modification of data held in databases or files. Query languages usually consist of a collection of operators which can be applied to any valid instances of the data structure types of a data model, in any combination desired.

In the context of graph data management, a graph query language (GQL) defines the way to retrieve or extract data which have been modeled as a graph and whose structure is defined by a graph data model. Therefore, a GQL is designed to support specific graph operations, such as graph pattern matching and shortest path finding.


Graph Path Navigation
Arenas M., Barceló P., Libkin L. (2018) Graph Path Navigation. In: Sakr S., Zomaya A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham.

In the Encyclopedia of Big Data Technologies:

Navigational query languages for graph databases allow to recursively traverse the edges of a graph while checking for the existence of a path that satisfies certain regular conditions. The basic building block of such languages is the class of regular path queries (RPQs), which are expressions that compute the pairs of nodes that are linked by a path whose label satisfies a regular expression. RPQs are often extended with features that turn them more flexible for practical applications, e.g., with the ability to traverse edges in the backward direction (RPQs with inverses) or to express arbitrary patterns over the data (conjunctive RPQs).


Containment of queries for graphs with data
Egor V. Kostylev, Juan L. Reutter, Domagoj Vrgoc. J. Comput. Syst. Sci. 92: 65-91 (2018)

We consider the containment problem for regular queries with memory and regular queries with data tests: two recently proposed query languages for graph databases that, in addition to allowing the user to ask topological queries, also track how the data changes along paths connecting various points in the database. Our results show that the problem is undecidable in general. However, by allowing only positive data comparisons we find natural fragments with better static analysis properties: the containment problem is PSpace -complete in the case of regular queries with data tests and ExpSpace -complete in the case of regular queries with memory.

Highlights: -Study of graph query languages that can deal with data values. -Containment problem for these languages is undecidable in general. -Containment is decidable if one focuses on languages that can only check for equalities. -Proofs make use of automata models.


Graph Data Models
Gutierrez C., Hidders J., Wood P.T. (2018) Graph Data Models. In: Sakr S., Zomaya A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. DOI
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Marcelo Mendoza, Sergio A. Velastin. 22nd Iberoamerican Congress, CIARP 2017, Valparaíso, Chile, November 7-10, 2017, Proceedings. Lecture Notes in Computer Science 10657, Springer 2018, ISBN 978-3-319-75192-4.

This book constitutes the refereed post-conference proceedings of the 22nd Iberoamerican Congress on Pattern Recognition, CIARP 2017, held in Valparaíso, Chile, in November 2017.

The 87 papers presented were carefully reviewed and selected from 156 submissions. The papers feature research results in the areas of pattern recognition, image processing, computer vision, multimedia and related fields.


RDF Compression
Martínez-Prieto M.A., Fernández J.D., Hernández-Illera A., Gutiérrez C. (2018) RDF Compression. In: Sakr S., Zomaya A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. DOI

RDF compression can be defined as the problem of encoding an RDF dataset using less bits than that required by text-based traditional serialization formats like RDF/XMLNTriples, or Turtle, among others. These savings immediately lead to more efficient storage (i.e., archival) and less transmission costs (i.e., less bits over the wire). Although this problem can be easily solved through universal compression (e.g., gzip or bzip2), optimized RDF-specific compressors take advantage of the particular features of RDF datasets (such as semantic redundancies) in order to save more bits or to provide retrieval operations on the compressed information. RDF self-indexes are focused on this latter task.

RDF self-indexes are RDF compressors that provide indexing features in a space close to that of the compressed dataset and can be accessed with no prior (or partial) decompression. These properties enhance scalability (i.e., less resources are required to serve semantic data) and speed up access as more information can be managed in higher levels of the memory hierarchy (typically, main memory or cache). In addition, efficient search algorithms have been proposed to resolve basic queries on top of self-indexed datasets. As a result, RDF self-indexes has been adopted as a core component of semantic search engines and lightweight Linked Data servers.

Finally, RDF stream compressors specifically focus on compressing a (continuous) stream of RDF data in order to improve exchange processes, typically in real-time. This constitutes a more recent trend that exploits different trade-offs between the space savings achieved by the compressor and the latency introduced in the compression/decompression processes.

This entry introduces basic notions of RDF compression, RDF self-indexing, and RDF stream compression and discusses how existing approaches deal with (and remove) redundant information in semantic datasets.



Feature-Based 3D Object Retrieval
Bustos B., Schreck T. (2017). In: Liu L., Özsu M. (eds) Encyclopedia of Database Systems. Springer, New York, NY.

3D objects are an important type of data with many applications in domains such as Engineering and Computer Aided Design, Science, Simulation, Visualization, Cultural Heritage, and Entertainment. Technological progress in acquisition, modeling, processing, and dissemination of 3D geometry leads to the accumulation of large repositories of 3D objects. Consequently, there is a strong need to research and develop technology to support the effective retrieval of 3D object data from 3D repositories.