Publicaciones
Several queries and scores have recently been proposed to explain individual predictions over ML models. Examples include queries based on “anchors”, which are parts of an instance that are sufficient to justify its classification, and “feature-perturbation” scores such as SHAP. Given the need for flexible, reliable, and easy-to-apply interpretability methods for ML models, we foresee the need for developing declarative languages to naturally specify different explainability queries. We do this in a principled way by rooting such a language in a logic called FOIL, which allows for expressing many simple but important explainability queries, and might serve as a core for more expressive interpretability languages. We study the computational complexity of FOIL queries over two classes of ML models often deemed to be easily interpretable: decision trees and more general decision diagrams. Since the number of possible inputs for an ML model is exponential in its dimension, tractability of the FOIL evaluation problem is delicate but can be achieved by either restricting the structure of the models, or the fragment of FOIL being evaluated. We also present a prototype implementation of FOIL wrapped in a high-level declarative language and perform experiments showing that such a language can be used in practice.
«Parametric Sensitivity in Graph Neural Network for Urban Cluster Detection, a Case Study» in «Proceedings of the Tenth International Conference on Complex Networks and their Applications».
With the popularity of Bitcoin, there is a growing need to understand the functionality, security, and performance of various mechanisms that comprise it. In this paper, we analyze Bitcoin’s scripting language, Script, that is one of the main building blocks of Bitcoin transactions. We formally define the semantics of Script, and study the problem of determining whether a user-defined script is well-formed; that is, whether it can be unlocked, or whether it contains errors that would prevent this from happening.
US media and politics are defined by asymmetry and reactivity, with the Left operating by one set of rules and the Right by another. How should we respond to a right-wing media ecosystem increasingly detached from the facts but that reacts aggressively to news coverage and social media discourses from the political center and left? In this chapter, we argue that in a changing media ecology characterized by such asymmetric modes of interaction, we need fact-checking organizations, news outlets, platforms, academics, and foundations to commit to “frame-checking,” a device for countering the political right’s efforts not only to mislead but to distract and disorient. Publicizing frame diversions, we argue, will shed light on how the Right often ignores important events, promoting in their stead spurious interpretations and irrelevant events that dangerously disrupt the routines of enlightened public discourse.
Between 2009 and 2019, Chile experienced the rise and fall of a powerful and influential environmental movement. This movement spurred massive protests against large-scale energy and mining projects, successfully blocking many of them. Although these demonstrations brought together people of all ages and backgrounds, youth were particularly active in advocating for the environment. As digital natives, young people may experiment with new ways of engaging in participatory actions, especially through social network sites, instant messaging and other social applications. We use data from the annual Youth, Participation, and Media Use surveys fielded between 2009 and 2019 to study the individual-level relationship between social media and environmental activism among young Chileans. As expected, we find that social media use is positively associated with participation in environmental issues. Nevertheless, this relationship is dynamic, gradually weakening over time. Thus, our results suggest that social media effects on environmental activism are contingent upon the specific stage of the protest cycle. We close with a discussion of the relevance of our findings as well as their limitations.
Objective Healthcare workers (HCWs) are at increased risk for SARS-CoV-2 infection, however not all face the same risk. We aimed to determine IgG/IgM prevalence and risk factors associated with seropositivity in Chilean HCWs. Study Design and Setting This was a nationwide, cross-sectional study including a questionnaire and COVID-19 lateral flow IgG/IgM antibody testing. All HCWs in the Chilean public health care system were invited to participate following the country’s first wave. Results IgG/IgM positivity in 85,529 HCWs was 7.2%, ranging from 1.6% to 12.4% between regions. Additionally, 9.7% HCWs reported a positive PCR of which 47% were seropositive. Overall, 10,863 (12.7%) HCWs were PCR and/or IgG/IgM positive.Factors independently associated with increased odds ratios (ORs) for seropositivity were: working in a hospital, night shifts, contact with Covid-19, using public transport, male gender, age>45, BMI ≥30, and reporting ≥2 symptoms. Stress/mental health disorder and smoking were associated with decreased ORs. These factors remained significant when including PCR positive cases in the model. Conclusions HCWs in the hospital were at highest risk for COVID-19, and several independent risk factors for seropositivity and/or PCR positivity were identified.
Complex event recognition (CER) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real time. CER finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. Existing CER languages lack a clear semantics, however, which makes them hard to understand and generalize. Moreover, there are no general techniques for evaluating CER query languages with clear performance guarantees.
In this article, we embark on the task of giving a rigorous and efficient framework to CER. We propose a formal language for specifying complex events, called complex event logic (CEL), that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. We give insight into the language design trade-offs regarding the strict sequencing operators of CEL and selection strategies.
With a well-defined semantics at hand, we discuss how to efficiently process complex events by evaluating CEL formulas with unary filters. We start by introducing a formal computational model for CER, called complex event automata (CEA), and study how to compile CEL formulas with unary filters into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by output-linear delay enumeration of the results.
RDF has seen increased adoption in recent years, prompting the standardization of the SPARQL query language for RDF, and the development of local and distributed engines for processing SPARQL queries. This survey paper provides a comprehensive review of techniques and systems for querying RDF knowledge graphs. While other reviews on this topic tend to focus on the distributed setting, the main focus of the work is on providing a comprehensive survey of state-of-the-art storage, indexing and query processing techniques for efficiently evaluating SPARQL queries in a local setting (on one machine). To keep the survey self-contained, we also provide a short discussion on graph partitioning techniques used in the distributed setting. We conclude by discussing contemporary research challenges for further improving SPARQL query engines. An extended version also provides a survey of over one hundred SPARQL query engines and the techniques they use, along with twelve benchmarks and their features.
In the last two years, Colombia and Chile have witnessed strong social protests, characterized by slogans against inequality and the lack of social mobility. In this study we propose a comparative study on social mobility and the persistence of structural social inequalities in both countries. We collect evidence on the level of social immobility and test if it is rooted in historical forms of social segregation in both countries. We base our analysis in surname based methods. We conclude that there are clear indications of a significant persistence of upward immobility of the groups that were originally segregated during the colonial period: Afro-descendants (Colombia) and indigenous people (in both). Furthermore, we find that the downward social immobility of the elites shows an important persistence in both countries. However, in Chile the colonial elites (encomenderos and landowners) present greater persistence in their privileged status, while in Colombia those early elites seem to have converged more quickly to the mean. In both countries, there is a clear persistence of the elites of the second half of the 19th century in todays highest position of the social ladder.
Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description. This association can be mainly made by retrieving the most relevant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze twenty-six benchmark datasets, showing their drawbacks and strengths for the problem requirements. We also show the progress that researchers have made on each dataset, we cover the challenges in the field, and we discuss future research directions.
We address the problem of representing dynamic graphs using -trees. The -tree data structure is one of the succinct data structures proposed for representing static graphs, and binary relations in general. It relies on compact representations of bit vectors. Hence, by relying on compact representations of dynamic bit vectors, we can also represent dynamic graphs. However, this approach suffers of a well known bottleneck in compressed dynamic indexing. In this paper we present a -tree based implementation which follows instead the ideas by Munro et al. (PODS 2015) to circumvent this bottleneck. We present two dynamic graph -tree implementations, one as a standalone implementation and another as a C++ library. The library includes efficient edge and neighbourhood iterators, as well as some illustrative algorithms. Our experimental results show that these implementations are competitive in practice.
The SPARQL query language is the standard for querying RDF data and has been implemented in a wide variety of engines. These engines support hundreds of public endpoints on the Web which receive thousands of queries daily. In many cases these endpoints struggle when evaluating complex queries or when they receive too many of them concurrently. They struggle mostly since some of these queries need large amounts of resources to be processed. All these engines have an internal query optimizer that proposes a supposedly optimal query execution plan, however this is a hard task since there may be thousands of possible query plans to consider and the optimizer may not chose the best one. Herein we propose the use of machine learning techniques to help in finding the best query plan for a given query fast, and thus improve the SPARQL servers’ performance. We base such optimization in modeling SPARQL queries based on their complexity, operators used within the queries and data accessed, among others. In this work we propose the use of Dense Neural Networks to improve such SPARQL query processing times. Herein we present the general architecture of a neural network for optimizing SPARQL queries and the results over a synthetic benchmark and real world queries. We show that the use of Dense Neural Networks improve the performance of the Nu-SVR approach in about 50% in performance. We also contribute to the community with a dataset of 19,000 queries.
The most competitive heuristics for calculating the median string are those that use perturbation-based iterative algorithms. Given the complexity of this problem, which under many formulations is NP-hard, the computational cost involved in the exact solution is not affordable. In this work, the heuristic algorithms that solve this problem are addressed, emphasizing its initialization and the policy to order possible editing operations. Both factors have a significant weight in the solution of this problem. Initial string selection influences the algorithm’s speed of convergence, as does the criterion chosen to select the modification to be made in each iteration of the algorithm. To obtain the initial string, we use the median of a subset of the original dataset; to obtain this subset, we employ the Half Space Proximal (HSP) test to the median of the dataset. This test provides sufficient diversity within the members of the subset while at the same time fulfilling the centrality criterion. Similarly, we provide an analysis of the stop condition of the algorithm, improving its performance without substantially damaging the quality of the solution. To analyze the results of our experiments, we computed the execution time of each proposed modification of the algorithms, the number of computed editing distances, and the quality of the solution obtained. With these experiments, we empirically validated our proposal.
Nowadays, misinformation and hoaxes travel faster than they did in the past, mostly thanks to the emergence of digital platforms and the popularization of social networking sites. Scholars have found that journalists have turned social media platforms into essential components of their professional operations. However, the extent to which journalists engage in debunking misinformation on social media is still unclear. By conducting a U.S. Nationally representative survey with more than 400 journalists, this study delves into journalists’ perceptions of false information, social media use, and debunking actions to expose misleading content in online contexts. Our findings indicate low levels of debunking, although we found factors associated with journalists either confronting or reporting misinformation. On the one hand, journalists who use social media platforms to develop their brands and engage directly with their audiences are more likely to publicly confront misinformation. On the other hand, journalists who believe social media companies should be held accountable for the spread of fake news do not engage directly in confronting false information, but do report it when they encounter it. Taken together, our findings suggest the journalist-audience relationship plays a central role to understand debunking behaviors in online spaces.
Assessing and improving the quality of data are fundamental challenges in Big-Data applications. These challenges have given rise to numerous solutions targeting transformation, integration, and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between standalone tools in these areas. In this article, we focus on the problem of determining whether the available data-transforming procedures can be used together to bring about the desired quality characteristics of the data in business or analytics processes. For example, to help an organization avoid building a data-quality solution from scratch when facing a new analytics task, we ask whether the data quality can be improved by reusing the tools that are already available, and if so, which tools to apply, and in which order, all without presuming knowledge of the internals of the tools, which may be external or proprietary.
Toward addressing this problem, we conduct a formal study in which individual data cleaning, data migration, or other data-transforming tools are abstracted as black-box procedures with only some of the properties exposed, such as their applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. As a proof of concept, we provide foundational results on sequential applications of procedures abstracted in this way, to achieve prespecified data-quality objectives, for the use case of relational data and for procedures described by standard relational constraints. We show that, while reasoning in this framework may be computationally infeasible in general, there exist well-behaved cases in which these foundational results can be applied in practice for achieving desired data-quality results on Big Data.
«Procesamiento de Lenguaje Natural: dónde estamos y qué estamos haciendo» in Revista Bits de Ciencia.
Premio Turing 2019: la revolución de la animación 3D por computadora in Revista Bits de Ciencia.
Historia y evolución de la inteligencia artificial in Revista Bits de Ciencia
Article: «Indigenous Movements, Parties, And the State: Comparative Lessons From Latin America» in APSA Comparative Politics Newsletter.
Article: «El proyecto Cybersyn: sus antecedentes técnicos» in Cuadernos de Beauchef.
«Conectando la visión y el lenguaje» in: Revista Bits de Ciencia (2021).
«Aprendizaje profundo en sistemas de recomendación» in: Revista Bits de Ciencia (2021).
«Aprendizaje de representaciones en grafos y su importancia en el análisis de redes» in «Revista Bits de Ciencia»(2021).
The automatic detection of rumors in social networks has gained considerable attention from researchers and practitioners during the last decade, due to the consequences of the spread of disinformation in public opinion. Most of the existing methods make use of features extracted from conversational threads, user profiles, and structural information of the network. These features are difficult to capture in practice and are often only partially available during the spread of rumors. In this paper, we study an unexplored approach in rumor detection: time series classification (TSC). By modeling the problem using time series, we avoid using lexical or structural characteristics of the network. Instead, we use information that is simpler to capture, such as the volume of tweets and the number of followers and followees of the users involved in a story. In this way, the characterization of the story is not related to specific users, but to variables aggregated at the event level.We introduce a TSC-based model for detecting rumors based on hypergraph partitioning, aligning time series prototypes with rumor classes. Our approach uses a Siamese network to train a rumor detection model in a supervised way, minimizing the distance between the time series of the training examples and the prototypes of their class. Results on benchmark data show that our approach surpasses other TSC-based methods in detecting rumors. Also, we compare our methods performance with methods that make use of lexical and structural characteristics. Our experiments show that our method has advantages in time-sensitive contexts, outperforming the state of the art in early detection scenarios with incomplete information.
We introduce the Automatic Learning for the Rapid Classification of Events (ALeRCE) broker, an astronomical alert broker designed to provide a rapid and self–consistent classification of large etendue telescope alert streams, such as that provided by the Zwicky Transient Facility (ZTF) and, in the future, the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST). ALeRCE is a Chilean–led broker run by an interdisciplinary team of astronomers and engineers, working to become intermediaries between survey and follow–up facilities. ALeRCE uses a pipeline which includes the real–time ingestion, aggregation, cross–matching, machine learning (ML) classification, and visualization of the ZTF alert stream. We use two classifiers: a stamp–based classifier, designed for rapid classification, and a light–curve–based classifier, which uses the multi–band flux evolution to achieve a more refined classification. We describe in detail our pipeline, data products, tools and services, which are made public for the community (see \url{https://alerce.science}). Since we began operating our real–time ML classification of the ZTF alert stream in early 2019, we have grown a large community of active users around the globe. We describe our results to date, including the real–time processing of 9.7×107 alerts, the stamp classification of 1.9×107 objects, the light curve classification of 8.5×105 objects, the report of 3088 supernova candidates, and different experiments using LSST-like alert streams. Finally, we discuss the challenges ahead to go from a single-stream of alerts such as ZTF to a multi–stream ecosystem dominated by LSST.
In the last decade, substantial progress has been made towards standardizing the syntax of graph query languages, and towards understanding their semantics and complexity of evaluation. In this paper, we consider temporal property graphs (TPGs) and propose temporal regular path queries (TRPQs) that incorporate time into TPG navigation. Starting with design principles, we propose a natural syntactic extension of the MATCH clause of popular graph query languages. We then formally present the semantics of TRPQs, and study the complexity of their evaluation. We show that TRPQs can be evaluated in polynomial time if TPGs are time-stamped with time points, and identify fragments of the TRPQ language that admit efficient evaluation over a more succinct interval-annotated representation. Finally, we implement a fragment of the language in a state-of-the-art dataflow framework, and experimentally demonstrate that TRPQ can be evaluated efficiently.
Gaussian processes (GPs) are used widely in the analysis of astronomical time series. GPs with rational spectral densities have state-space representations which allow ${ \mathcal O }(n)$ evaluation of the likelihood. We calculate analytic state space representations for the damped simple harmonic oscillator and the Matérn 1/2, 3/2 and 5/2 processes.
The classic classification scheme for Active Galactic Nuclei (AGNs) was recently challenged by the discovery of the so-called changing-state (changing-look) AGNs (CSAGNs). The physical mechanism behind this phenomenon is still a matter of open debate and the samples are too small and of serendipitous nature to provide robust answers. In order to tackle this problem, we need to design methods that are able to detect AGN right in the act of changing-state. Here we present an anomaly detection (AD) technique designed to identify AGN light curves with anomalous behaviors in massive datasets. The main aim of this technique is to identify CSAGN at different stages of the transition, but it can also be used for more general purposes, such as cleaning massive datasets for AGN variability analyses. We used light curves from the Zwicky Transient Facility data release 5 (ZTF DR5), containing a sample of 230,451 AGNs of different classes. The ZTF DR5 light curves were modeled with a Variational Recurrent Autoencoder (VRAE) architecture, that allowed us to obtain a set of attributes from the VRAE latent space that describes the general behaviour of our sample. These attributes were then used as features for an Isolation Forest (IF) algorithm, that is an anomaly detector for a «one class» kind of problem. We used the VRAE reconstruction errors and the IF anomaly score to select a sample of 8,809 anomalies. These anomalies are dominated by bogus candidates, but we were able to identify 75 promising CSAGN candidates.
We describe some recent approaches to score-based explanations for query answers in databases and outcomes from classification models in machine learning. The focus is on work done by the author and collaborators. Special emphasis is placed on declarative approaches based on answer-set programming to the use of counterfactual reasoning for score specification and computation. Several examples that illustrate the flexibility of these methods are shown.
«Recordar el futuro, la frontera de las tecnologías de búsqueda», chapter of the book: «De neuronas a galaxias. ¿Es el universo un holograma» (2021).
Book: «Principles of Databases (Preliminary Version)» (2021).
In Machine Learning, the 𝖲𝖧𝖠𝖯-score is a version of the Shapley value that is used to explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is an intractable problem, we prove a strong positive result stating that the 𝖲𝖧𝖠𝖯-score can be computed in polynomial time over deterministic and decomposable Boolean circuits. Such circuits are studied in the field of Knowledge Compilation and generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees and Ordered Binary Decision Diagrams (OBDDs).
We also establish the computational limits of the SHAP-score by observing that computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider. It also implies that computing 𝖲𝖧𝖠𝖯-scores is intractable as well over the class of propositional formulas in DNF. Based on this negative result, we look for the existence of fully-polynomial randomized approximation schemes (FPRAS) for computing 𝖲𝖧𝖠𝖯-scores over such class. In contrast to the model counting problem for DNF formulas, which admits an FPRAS, we prove that no such FPRAS exists for the computation of 𝖲𝖧𝖠𝖯-scores. Surprisingly, this negative result holds even for the class of monotone formulas in DNF. These techniques can be further extended to prove another strong negative result: Under widely believed complexity assumptions, there is no polynomial-time algorithm that checks, given a monotone DNF formula φ and features x,y, whether the 𝖲𝖧𝖠𝖯-score of x in φ is smaller than the 𝖲𝖧𝖠𝖯-score of y in φ.
Evidence is emerging that old adults share more disinformation online and are more politically polarized. Based on the Facebook Privacy-Protected Full URLs Data Set, a vast Facebook database with demographic information of those who saw and shared links on Facebook in 46 countries, we investigated the impact of age on link-sharing activity. We found that in 45 of 46 countries, the average age of people who shared links was considerably higher than the age of those who saw the links. In a more detailed study, with 82% of Facebook users residing in South America, we found that the average age increases consecutively in the sharing of non-political content, in the sharing of political content and in the sharing of partisan sites.
Body-mass index (BMI) is a well-known marker of adiposity across all ages. The genetic architecture of BMI has been thoroughly studied among adults. In contrast, there are a few genome-wide association studies (GWAS) on children. Further, GWAS on children have been performed almost exclusively in Europeans at single ages. We aimed to better understand the genetic architecture of BMI trajectory across ages and how BMI is affected by Native American genetic ancestry. We performed cross-sectional and longitudinal GWAS for BMI-related traits on 904 admixed Chilean children with mostly European and Mapuche Native American genetic ancestry. We focused on BMI and two traits that occur at the minimum of the childhood BMI growth trajectory, namely, age at adiposity rebound (Age-AR) and BMI at adiposity rebound (BMI-AR). We found several variants in the immune gene HLA-DQB3 that are strongly associated with BMI at ages 1.5-2.5 years old, but not at other ages. We also identified a variant in the sex-determining gene DMRT1 significantly associated with Age-AR (P = 9.8 × 10−9). Further, BMI was significantly higher in Mapuche than in European children at all ages between 5.5 and 16.5 years old, but not before. Finally, Age-AR was significantly lower (P = 0.013) by 1.64 years in the Mapuche children compared with Europeans.
Summarization has usually relied on gold standard summaries to train extractive or abstractive models. Social media brings a hurdle to summarization techniques since it requires addressing a multi-document multi-author approach. We address this challenging task by introducing a novel method that generates abstractive summaries of online news discussions. Our method extends a BERT-based architecture, including an attention encoding that fed comments’ likes during the training stage. To train our model, we define a task which consists of reconstructing high impact comments based on popularity (likes). Accordingly, our model learns to summarize online discussions based on their most relevant comments. Our novel approach provides a summary that represents the most relevant aspects of a news item that users comment on, incorporating the social context as a source of information to summarize texts in online social networks. Our model is evaluated using ROUGE scores between the generated summary and each comment on the thread. Our model, including the social attention encoding, significantly outperforms both extractive and abstractive summarization methods based on such evaluation.
Book: «Las condiciones sociohistóricas de América Latina: Un abordaje desde el desarrollo, la autoridad, la política y la historia”
This chapter examines the state of investigative journalism in Latin America. It focuses on four major trends: the rise of collaborative forms of regional and global reporting, the consolidation of vibrant digital news sites that scrutinise political and economic power in relation to a range of social issues, the use of data journalism techniques, and the rise of fact checkers to debunk misinformation in the region. Investigative journalism was hamstrung by powerful political and economic forces, both at the level of the state and media corporations more interested in pursuing narrow industrial interests than in truth-telling. Collaborative journalism is “a cooperative arrangement between two or more news and information organisations, which aims to supplement each organisation’s resources and maximize the impact of the content produced”.
Innovación social en ciudades portuarias de Chile es un trabajo transdisciplinario sobre la relación ciudad-puerto a nivel latinoamericano. Su lectura es imprescindible para comprender las complejidades asociadas con la logística portuaria en este continente. Realiza un exhaustivo diagnóstico de la relación entre los puertos y las ciudades metropolitanas. Con un análisis profundo del caso del Área Metropolitana de Concepción, en la Región del Biobío, la investigación aborda los problemas y oportunidades en la convivencia de las comunidades urbanas con sistemas portuarios complejos.
The sudden loss of smell is among the earliest and most prevalent symptoms of COVID-19 when measured with a clinical psychophysical test. Research has shown the potential impact of frequent screening for olfactory dysfunction, but existing tests are expensive and time consuming. We developed a low-cost ($0.50/test) rapid psychophysical olfactory test (KOR) for frequent testing and a model-based COVID-19 screening framework using a Bayes Network symptoms model. We trained and validated the model on two samples: suspected COVID-19 cases in five healthcare centers (n = 926; 33% prevalence, 309 RT-PCR confirmed) and healthy miners (n = 1,365; 1.1% prevalence, 15 RT-PCR confirmed). The model predicted COVID-19 status with 76% and 96% accuracy in the healthcare and miners samples, respectively (healthcare: AUC = 0.79 [0.75–0.82], sensitivity: 59%, specificity: 87%; miners: AUC = 0.71 [0.63–0.79], sensitivity: 40%, specificity: 97%, at 0.50 infection probability threshold). Our results highlight the potential for low-cost, frequent, accessible, routine COVID-19 testing to support society’s reopening.
Various recent proposals increase the distinguishing power of Graph Neural Networks GNNs by propagating features between k-tuples of vertices. The distinguishing power of these «higher-order» GNNs is known to be bounded by the k-dimensional Weisfeiler-Leman (WL) test, yet their (nk) memory requirements limit their applicability. Other proposals infuse GNNs with local higher-order graph structural information from the start, hereby inheriting the desirable (n) memory requirement from GNNs at the cost of a one-time, possibly non-linear, preprocessing step. We propose local graph parameter enabled GNNs as a framework for studying the latter kind of approaches and precisely characterize their distinguishing power, in terms of a variant of the WL test, and in terms of the graph structural properties that they can take into account. Local graph parameters can be added to any GNN architecture, and are cheap to compute. In terms of expressive power, our proposal lies in the middle of GNNs and their higher-order counterparts. Further, we propose several techniques to aide in choosing the right local graph parameters. Our results connect GNNs with deep results in finite model theory and finite variable logics. Our experimental evaluation shows that adding local graph parameters often has a positive effect for a variety of GNNs, datasets and graph learning tasks.
Several queries and scores have recently been proposed to explain individual predictions over ML models. Given the need for flexible, reliable, and easy-to-apply interpretability methods for ML models, we foresee the need for developing declarative languages to naturally specify different explainability queries. We do this in a principled way by rooting such a language in a logic, called FOIL, that allows for expressing many simple but important explainability queries, and might serve as a core for more expressive interpretability languages. We study the computational complexity of FOIL queries over two classes of ML models often deemed to be easily interpretable: decision trees and OBDDs. Since the number of possible inputs for an ML model is exponential in its dimension, the tractability of the FOIL evaluation problem is delicate but can be achieved by either restricting the structure of the models or the fragment of FOIL being evaluated. We also present a prototype implementation of FOIL wrapped in a high-level declarative language and perform experiments showing that such a language can be used in practice.
Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description. This association can be mainly made by retrieving the most relevant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze twenty-six benchmark datasets, showing their drawbacks and strengths for the problem requirements. We also show the progress that researchers have made on each dataset, we cover the challenges in the field, and we discuss future research directions.
A novel first-order moving-average model for analyzing time series observed at irregularly spaced intervals is introduced. Two definitions are presented, which are equivalent under Gaussianity. The first one relies on normally distributed data and the specification of second-order moments. The second definition provided is more flexible in the sense that it allows for considering other distributional assumptions. The statistical properties are investigated along with the one-step linear predictors and their mean squared errors. It is established that the process is strictly stationary under normality and weakly stationary in the general case. Maximum likelihood and bootstrap estimation procedures are discussed and the finite-sample behavior of these estimates is assessed through Monte Carlo experiments. In these simulations, both methods perform well in terms of estimation bias and standard errors, even with relatively small sample sizes. Moreover, we show that for non-Gaussian data, for t-Student and Generalized errors distributions, the parameters of the model can be estimated precisely by maximum likelihood. The proposed IMA model is compared to the continuous autoregressive moving average (CARMA) models, exhibiting good performance. Finally, the practical application and usefulness of the proposed model are illustrated with two real-life data examples.
This chapter summarizes contributions made by Ricardo Baeza-Yates, Francesco Bonchi, Kate Crawford, Laurence Devillers and Eric Salobir in the session chaired by Françoise Fogelman-Soulié on AI & Human values at the Global Forum on AI for Humanity. It provides an overview of key concepts and definitions relevant for the study of inequalities and Artificial Intelligence. It then presents and discusses concrete examples of inequalities produced by AI systems, highlighting their variety and potential harmfulness. Finally, we conclude by discussing how putting human values at the core of AI requires answering many questions, still open for further research.
This trend study describes changes and continuities in the stratification of usage of Facebook, Twitter, Instagram, and WhatsApp in Chile between 2009-2019—the decade that witnessed the rise of social media. Using the Youth, Media and Participation Study—a probabilistic survey conducted on an annual basis among 1,000 individuals aged 18 to 29 living in the three largest urban areas in Chile (N = 10,518)—we analyze how frequency of use and type of activities conducted on social media has varied over time along socioeconomic status, gender, and age cohort. Instead of a uniform trend towards less (or greater) inequality, the results show that each platform exhibits a unique dynamic. For instance, whereas SES-based inequality in frequency of use has decreased on Facebook over time, it has remained stable on WhatsApp and increased on Twitter and Instagram. In addition, significant differences in the likelihood of conducting different activities (e.g., chatting, commenting news, sharing links) remained across groups, even on platforms such as Facebook where frequency of use has equalized over time.
In several disciplines it is common to find time series measured at irregular observational times. In particular, in astronomy there are a large number of surveys that gather information over irregular time gaps and in more than one passband. Some examples are Pan-STARRS, ZTF and also the LSST. However, current commonly used time series models that estimate the time dependency in astronomical light curves consider the information of each band separately (e.g, CIAR, IAR and CARMA models) disregarding the dependency that might exist between different passbands. In this paper we propose a novel bivariate model for irregularly sampled time series, called the bivariate irregular autoregressive (BIAR) model. The BIAR model assumes an autoregressive structure on each time series, it is stationary, and it allows to estimate the autocorrelation, the cross-correlation and the contemporary correlation between two unequally spaced time series. We implemented the BIAR model on light curves, in the g and r bands, obtained from the ZTF alerts processed by the ALeRCE broker. We show that if the light curves of the two bands are highly correlated, the model has more accurate forecast and prediction using the bivariate model than a similar method that uses only univariate information. Further, the estimated parameters of the BIAR are useful to characterize LongPeriod Variable Stars and to distinguish between classes of stochastic objects, providing promising features that can be used for classification purpose.
Medical images are an essential input for the timely diagnosis of pathologies. Despite its wide use in the area, searching for images that can reveal valuable information to support decision-making is difficult and expensive. However, the possibilities that open when making large repositories of images available for search by content are unsuspected. We designed a content-based image retrieval system for medical imaging, which reduces the gap between access to information and the availability of useful repositories to meet these needs. The system operates on the principle of query-by-example, in which users provide medical images, and the system displays a set of related images. Unlike metadata match-driven searches, our system drives content-based search. This allows the system to conduct searches on repositories of medical images that do not necessarily have complete and curated metadata. We explore our system’s feasibility in computational tomography (CT) slices for SARS-CoV-2 infection (COVID-19), showing that our proposal obtains promising results, advantageously comparing it with other search methods.
Word embeddings are vital descriptors of words in unigram representations of documents for many tasks in natural language processing and information retrieval. The representation of queries has been one of the most critical challenges in this area because it consists of a few terms and has little descriptive capacity. Strategies such as average word embeddings can enrich the queries’ descriptive capacity since they favor the identification of related terms from the continuous vector representations that characterize these approaches. We propose a data-driven strategy to combine word embeddings. We use Idf combinations of embeddings to represent queries, showing that these representations outperform the average word embeddings recently proposed in the literature. Experimental results on benchmark data show that our proposal performs well, suggesting that data-driven combinations of word embeddings are a promising line of research in ad-hoc information retrieval.
This study analyzes what “emergency sources” (authorities, emergency managers, and experts) expect from journalists during a disaster, using a mixed-method approach with six focus groups and a survey of 166 official Chilean sources. Based on the first three levels of the hierarchy of influences model, we explore how they perceive journalists’ roles and performance when covering disasters. The results suggest that emergency sources’ evaluations, while affected by a combination of individual, routine, and organizational variables, are mostly shaped by sources’ direct and mediated experience with journalists. Thus, a more fluid relationship between journalists and emergency sources, as well as more communication experience by sources, could lead to a better understanding between both groups, which, ultimately, may lead to delivering more accurate and timely information.
The design and registration of Pre-analysis Plans (PAP) represents a significant improvement in social science research transparency. This tool is commonly used in experimental research. In this research note, we suggest extending the use of PAP to qualitative research. In recent decades, researchers have produced several methodological innovations, which have improved the quality of qualitative analysis. New tools also have been developed and researchers have taken important steps to improve data collection and transparency in the analysis of qualitative data. The development of Pre-analysis Plan-Qualitative (PAP-Q) aims to synthetize these advances into a guide for researchers, in order to improve transparency and better specify the role of induction in the construction of causal arguments.
The recent incidents involving Dr. Timnit Gebru, Dr. Margaret Mitchell, and Google have triggered an important discussion emblematic of issues arising from the practice of AI Ethics research. We offer this paper and its bibliography as a resource to the global community of AI Ethics Researchers who argue for the protection and freedom of this research community. Corporate, as well as academic research settings, involve responsibility, duties, dissent, and conflicts of interest. This article is meant to provide a reference point at the beginning of this decade regarding matters of consensus and disagreement on how to enact AI Ethics for the good of our institutions, society, and individuals. We have herein identified issues that arise at the intersection of information technology, socially encoded behaviors, and biases, and individual researchers’ work and responsibilities. We revisit some of the most pressing problems with AI decision-making and examine the difficult relationships between corporate interests and the early years of AI Ethics research. We propose several possible actions we can take collectively to support researchers throughout the field of AI Ethics, especially those from marginalized groups who may experience even more barriers in speaking out and having their research amplified. We promote the global community of AI Ethics researchers and the evolution of standards accepted in our profession guiding a technological future that makes life better for all.
Ad hoc information retrieval (ad hoc IR) is a challenging task consisting of ranking text documents for bag-of-words (BOW) queries. Classic approaches based on query and document text vectors use term-weighting functions to rank the documents. Some of these methods’ limitations consist of their inability to work with polysemic concepts. In addition, these methods introduce fake orthogonalities between semantically related words. To address these limitations, model-based IR approaches based on topics have been explored. Specifically, topic models based on Latent Dirichlet Allocation (LDA) allow building representations of text documents in the latent space of topics, the better modeling of polysemy and avoiding the generation of orthogonal representations between related terms. We extend LDA-based IR strategies using different ensemble strategies. Model selection obeys the ensemble learning paradigm, for which we test two successful approaches widely used in supervised learning. We study Boosting and Bagging techniques for topic models, using each model as a weak IR expert. Then, we merge the ranking lists obtained from each model using a simple but effective top-k list fusion approach. We show that our proposal strengthens the results in precision and recall, outperforming classic IR models and strong baselines based on topic models.
Recently, few-shot video classification has received an increasing interest. Current approaches mostly focus on effectively exploiting the temporal dimension in videos to improve learning under low data regimes. However, most works have largely ignored that videos are often accompanied by rich textual descriptions that can also be an essential source of information to handle few-shot recognition cases. In this paper, we propose to leverage these human-provided textual descriptions as privileged information when training a few-shot video classification model. Specifically, we formulate a text-based task conditioner to adapt video features to the few-shot learning task. Furthermore, our model follows a transductive setting to improve the task-adaptation ability of the model by using the support textual descriptions and query instances to update a set of class prototypes. Our model achieves state-of-the-art performance on four challenging benchmarks commonly used to evaluate few-shot video action classification models.
We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and is theoretically justified by being the very single wealth-distribution measure that satisfies some natural axioms. While this value has been investigated in several areas, it received little attention in data management. We study this measure in the context of conjunctive and aggregate queries by defining corresponding coalition games. We provide algorithmic and complexity-theoretic results on the computation of Shapley-based contributions to query answers; and for the hard cases we present approximation algorithms.
Mobile instant messaging services (MIMs) are important gateways to news exposure and political conversations. Nevertheless, we still know little about the specific uses and consequences of using messaging apps on other aspects of democratic citizenship. This is especially true in Latin American countries, where usage of MIMs is more widespread than any other social media. Using a two-wave panel survey conducted in the context of the 2017 Chilean elections, this study examines the information sharing practices of WhatsApp users, comparing the antecedents and effects of the spread of personal (e.g., family, work) and public affairs content (e.g., news, political messages). Findings show that sharing on WhatsApp was rather equal across social groups, and that it could exert a significant influence on learning about politics and issues in the news as well as on protesting and other political behaviors. We discuss possible explanations, limitations, and significance of these results for digital journalism research and practice.
Reasoning modulo equivalences is natural for everyone, including mathematicians. Unfortunately, in proof assistants based on type theory, which are frequently used to mechanize mathematical results and carry out program verification efforts, equality is appallingly syntactic, and as a result, exploiting equivalences is cumbersome at best. Parametricity and univalence are two major concepts that have been explored in the literature to transport programs and proofs across type equivalences, but they fall short of achieving seamless, automatic transport. This work first clarifies the limitations of these two concepts when considered in isolation and then devises a fruitful marriage between both. The resulting concept, called univalent parametricity, is an extension of parametricity strengthened with univalence that fully realizes programming and proving modulo equivalences. Our approach handles both type and term dependency, as well as type-level computation. In addition to the theory of univalent parametricity, we present a lightweight framework implemented in the Coq proof assistant that allows the user to transparently transfer definitions and theorems for a type to an equivalent one, as if they were equal. For instance, this makes it possible to conveniently switch between an easy-to-reason-about representation and a computationally efficient representation as soon as they are proven equivalent. The combination of parametricity and univalence supports transport à la carte: basic univalent transport, which stems from a type equivalence, can be complemented with additional proofs of equivalences between functions over these types, in order to be able to transport more programs and proofs, as well as to yield more efficient terms. We illustrate the use of univalent parametricity on several examples, including a recent integration of native integers in Coq. This work paves the way to easier-to-use proof assistants by supporting seamless programming and proving modulo equivalences.
Graphs are by nature, ‘unifying abstractions’ that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed?
We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query q, we consider the following two problems: Given as input an incomplete database D, (a) return the number of completions of D that satisfy q; or (b) return the number of valuations of the nulls of D yielding a completion that satisfies q. We obtain dichotomies between #P-hardness and polynomial-time computability for these problems when q is a self-join–free conjunctive query and study the impact on the complexity of the following two restrictions: (1) every null occurs at most once in D (what is called Codd tables); and (2) the domain of each null is the same. Roughly speaking, we show that counting completions is much harder than counting valuations: For instance, while the latter is always in #P, we prove that the former is not in #P under some widely believed theoretical complexity assumption. Moreover, we find that both (1) and (2) can reduce the complexity of our problems. We also study the approximability of these problems and show that, while counting valuations always has a fully polynomial-time randomized approximation scheme (FPRAS), in most cases counting completions does not. Finally, we consider more expressive query languages and situate our problems with respect to known complexity classes.
Based on a geocoded registry of more than four million residents of Santiago, Chile, we build two surname-based networks that reveal the city’s population structure. The first network is formed from paternal and maternal surname pairs. The second network is formed from the isonymic distances between the city’s neighborhoods. These networks uncover the city’s main ethnic groups and their spatial distribution. We match the networks to a socioeconomic index, and find that surnames of high socioeconomic status tend to cluster, be more diverse, and occupy a well-defined quarter of the city. The results are suggestive of a high degree of urban segregation in Santiago.
Background
In Chile, a patient needing a specialty consultation or surgery has to first be referred by a general practitioner, then placed on a waiting list. The Explicit Health Guarantees (GES in Spanish) ensures, by law, the maximum time to solve 85 health problems. Usually, a health professional manually verifies if each referral, written in natural language, corresponds or not to a GES-covered disease. An error in this classification is catastrophic for patients, as it puts them on a non-prioritized waiting list, characterized by prolonged waiting times.
Methods
To support the manual process, we developed and deployed a system that automatically classifies referrals as GES-covered or not using historical data. Our system is based on word embeddings specially trained for clinical text produced in Chile. We used a vector representation of the reason for referral and patient’s age as features for training machine learning models using human-labeled historical data. We constructed a ground truth dataset combining classifications made by three healthcare experts, which was used to validate our results.
Results
The best performing model over ground truth reached an AUC score of 0.94, with a weighted F1-score of 0.85 (0.87 in precision and 0.86 in recall). During seven months of continuous and voluntary use, the system has amended 87 patient misclassifications.
Conclusion
This system is a result of a collaboration between technical and clinical experts, and the design of the classifier was custom-tailored for a hospital’s clinical workflow, which encouraged the voluntary use of the platform. Our solution can be easily expanded across other hospitals since the registry is uniform in Chile.
We consider the problem of designing a succinct data structure for representing the connectivity of planar triangulations. The main result is a new succinct encoding achieving the information-theory optimal bound of 3.24 bits per vertex, while allowing efficient navigation. Our representation is based on the bijection of Poulalhon and Schaeffer (Algorithmica, 46(3):505–527, 2006) that defines a mapping between planar triangulations and a special class of spanning trees, called PS-trees. The proposed solution differs from previous approaches in that operations in planar triangulations are reduced to operations in particular parentheses sequences encoding PS-trees. Existing methods to handle balanced parentheses sequences have to be combined and extended to operate on such specific sequences, essentially for retrieving matching elements. The new encoding supports extracting the d neighbors of a query vertex in O(d) time and testing adjacency between two vertices in O(1) time. Additionally, we provide an implementation of our proposed data structure. In the experimental evaluation, our representation reaches up to 7.35 bits per vertex, improving the space usage of state-of-the-art implementations for planar embeddings.
nternet, social media, and app shutdowns have become frequent, not only in authoritarian states but also in emerging and fragile democracies. As Russian authorities enforced a legal blockage to Instant Messenger Telegram during the past 2 years, many users kept using the app seamlessly thanks to what we call a subversive affordance: a built-in proxy functionality that allows users to seamlessly circumvent the blockage. We claim it is subversive because it allows users to overcome the blockage as the consequence of the app’s development, with a significant fraction of users who did not have to take action to bypass the blockage. By conducting an online survey and performing a meta-cluster analysis, we found a group we labeled the undeprived: people that, despite presenting traits frequently associated with digital divides—such as gender, age, and low levels of digital skills—were able to keep using the app.
Despite the growing scholarship on investigative journalism in Latin America, very few studies have addressed collaboration across newsrooms in the region. By analyzing the responses of 251 journalists who work for investigative units in Latin American news outlets, this study explores a) the reasons why Latin American journalists are increasingly seeking to participate in national and transnational collaborative enterprises, b) the challenges they identify, and c) the role digital technologies are playing in this trend of transnational collaboration. Using mixed methods, we found that collaborations occur to enhance the impact of investigative projects, to reach larger audiences, and to achieve a big picture coverage. We also found that safety is an important motivation to work in conjunction with other newsrooms—by collaborating, journalists are able to strengthen security measures and challenge censorship. Yet, coordinating teams—especially at the transnational level—remains the biggest challenge to overcome. Digital technologies are significantly related to reporters’ likelihood of collaborating, but these technologies require other reporting skills to be useful for investigative journalism. Implications for research and practice are discussed.
There is a recently established correspondence between database tuples as causes for query answers in databases and tuple-based repairs of inconsistent databases with respect to denial constraints. In this work, answer-set programs that specify database repairs are used as a basis for solving computational and reasoning problems around causality in databases, including causal responsibility. Furthermore, causes are introduced also at the attribute level by appealing to an attribute-based repair semantics that uses null values. Corresponding repair-programs are introduced, and used as a basis for computation and reasoning about attribute-level causes. The answer-set programs are extended in order to capture causality under integrity constraints.
Raster time series, a.k.a. temporal rasters, are collections of rasters covering the same region at consecutive timestamps. These data have been used in many different applications ranging from weather forecast systems to monitoring of forest degradation or soil contamination. Many different sensors are generating this type of data, which makes such analyses possible, but also challenges the technological capacity to store and retrieve the data. In this work, we propose a space-efficient representation of raster time series that is based on Compact Data Structures (CDS). Our method uses a strategy of snapshots and logs to represent the data, in which both components are represented using CDS. We study two variants of this strategy, one with regular sampling and another one based on a heuristic that determines at which timestamps should the snapshots be created to reduce the space redundancy. We perform a comprehensive experimental evaluation using real datasets. The results show that the proposed strategy is competitive in space with alternatives based on pure data compression, while providing much more efficient query times for different types of queries.
This paper presents the methods and results of the SHREC’21 track on a dataset of cultural heritage (CH) objects. We present a dataset of 938 scanned models that have varied geometry and artistic styles. For the competition, we propose two challenges: the retrieval-by-shape challenge and the retrieval-by-culture challenge. The former aims at evaluating the ability of retrieval methods to discriminate cultural heritage objects by overall shape. The latter focuses on assessing the effectiveness of retrieving objects from the same culture. Both challenges constitute a suitable scenario to evaluate modern shape retrieval methods in a CH domain. Ten groups participated in the challenges: thirty runs were submitted for the retrieval-by-shape task, and twenty-six runs were submitted for the retrieval-by-culture task. The results show a predominance of learning methods on image-based multi-view representations to characterize 3D objects. Nevertheless, the problem presented in our challenges is far from being solved. We also identify the potential paths for further improvements and give insights into the future directions of research.
Despite early promise, scholarship has shown little empirical evidence of learning from the news on social media. At the same time, scholars have documented the problem of information ‘snacking’ and information quality on these platforms. These parallel trends in the literature challenge long-held assumptions about the pro-social effects of news consumption and political participation. We argue that reliance on social media for news does not contribute to people’s real level of political knowledge (objective knowledge), but instead only influences people’s impression of being informed (subjective knowledge). Subjective knowledge is just as important for driving political participation, a potentially troubling trend given the nature of news consumption on social media. We test this expectation with panel survey data from the 2018 U.S. midterm elections. Two path model specifications (fixed effects and autoregressive) support our theoretical model. Implications for the study of the ‘dark side’ of social media and democracy are discussed.
We consider the feature-generation task wherein we are given a database with entities labeled as positive and negative examples, and we want to find feature queries that linearly separate the two sets of examples. We focus on conjunctive feature queries, and explore two problems: (a) deciding if separating feature queries exist (separability), and (b) generating such queries when they exist. To restrict the complexity of the generated classifiers, we explore various ways of regularizing them by limiting their dimension, the number of joins in feature queries, and their generalized hypertreewidth (ghw). We show that the separability problem is tractable for bounded ghw; yet, the generation problem is not because feature queries might be too large. So, we explore a third problem: classifying new entities without necessarily generating the feature queries. Interestingly, in the case of bounded ghw we can efficiently classify without explicitly generating such queries.
The need for recursive queries in the Semantic Web setting is becoming more and more apparent with the emergence of datasets where different pieces of information are connected by complicated patterns. This was acknowledged by the W3C committee by the inclusion of property paths in the SPARQL standard. However, as more data becomes available, it is becoming clear that property paths alone are not enough to capture all recursive queries that the users are interested in, and the literature has already proposed several extensions to allow searching for more complex patterns.
We propose a rather different, but simpler approach: add a general purpose recursion operator directly to SPARQL. In this paper we provide a formal syntax and semantics for this proposal, study its theoretical properties, and develop algorithms for evaluating it in practical scenarios. We also show how to implement this extension as a plug-in on top of existing systems, and test its performance on several synthetic and real world datasets, ranging from small graphs, up to the entire Wikidata database.
The problem of parameterized range majority asks us to preprocess a string of length n such that, given the endpoints of a range, one can quickly find all the distinct elements whose relative frequencies in that range are more than a threshold 𝜏. This is a more tractable version of the classical problem of finding the range mode, which is unlikely to be solvable in polylogarithmic time and linear space. In this paper we give the first linear-space solution with optimal 𝒪(1/𝜏) query time, even when 𝜏 can be specified with the query. We then consider data structures whose space is bounded by the entropy of the distribution of the symbols in the sequence. For the case when the alphabet size 𝜎 is polynomial on the computer word size, we retain the optimal time within optimally compressed space (i.e., with sublinear redundancy). Otherwise, either the compressed space is increased by an arbitrarily small constant factor or the time rises to any function in (1/𝜏)⋅𝜔(1). We obtain the same results on the complementary problem of parameterized range minority.
The recent surge of global populism has led many intellectuals to call for new forms of democratic elitism. Yet research into the sources of support for political organizations and regimes predicts that suppressing opportunities for public participation will likely exacerbate antisystem political tendencies. We cite the recent protests in Chile, a nation that has employed democratic elitism more effectively than perhaps any other, as illustrative of the eventual consequences of suppressing voice. Our research indicates that empowering citizens through vibrant parties and continuous democracy is the best way to avoid populist impulses and waves of contentious politics.
The goal of Question Answering over Knowledge Graphs (KGQA) is to find answers for natural language questions over a knowledge graph. Recent KGQA approaches adopt a neural machine translation (NMT) approach, where the natural language question is translated into a structured query language. However, NMT suffers from the out-of-vocabulary problem, where terms in a question may not have been seen during training, impeding their translation. This issue is particularly problematic for the millions of entities that large knowledge graphs describe. We rather propose a KGQA approach that delegates the processing of entities to entity linking (EL) systems. NMT is then used to create a query template with placeholders that are filled by entities identified in an EL phase. Slot filling is used to decide which entity fills which placeholder. Experiments for QA over Wikidata show that our approach outperforms pure NMT: while there remains a strong dependence on having seen similar query templates during training, errors relating to entities are greatly reduced.
Database tuples can be seen as players in the game of jointly realizing the answer to a query. Some tuples may contribute more than others to the outcome, which can be a binary value in the case of a Boolean query, a number for a numerical aggregate query, and so on. To quantify the contributions of tuples, we use the Shapley value that was introduced in cooperative game theory and has found applications in a plethora of domains. Specifically, the Shapley value of an individual tuple quantifies its contribution to the query. We investigate the applicability of the Shapley value in this setting, as well as the computational aspects of its calculation in terms of complexity, algorithms, and approximation.
We present pumping lemmas for five classes of functions definable by fragments of weighted automata over the min-plus semiring, the max-plus semiring and the semiring of natural numbers. As a corollary we show that the hierarchy of functions definable by unambiguous, finitely-ambiguous, polynomially-ambiguous weighted automata, and the full class of weighted automata is strict for the min-plus and max-plus semirings
This paper describes the submission of the IALab group of the Pontifical Catholic University of Chile to the Medical Domain Visual Question Answering (VQA-Med) task. Our participation was rather simple: we approached the problem as image classification. We took a DenseNet121 with its weights pre-trained in ImageNet and fine-tuned it with the VQA-Med 2020 dataset labels to predict the answer. Different answers were treated as different classes, and the questions were disregarded for simplicity since essentially they all ask for abnormalities. With this very simple approach we ranked 7th among 11 teams, with a test set accuracy of 0.236.
This article describes the participation and results of the PUC Chile team in the Turberculosis task in the context of ImageCLEFmedical challenge 2021. We were ranked 7th based on the kappa metric and 4th in terms of accuracy. We describe three approaches we tried in order to address the task. Our best approach used 2D images visually encoded with a DenseNet neural network, which representations were concatenated to finally output the classification with a softmax layer. We describe in detail this and other two approaches, and we conclude by discussing some ideas for future work.
This article describes PUC Chile team’s participation in the Concept Detection task of ImageCLEFmedical challenge 2021, which resulted in the team earning the fourth place. We made two submissions, the first one based on a naive approach which resulted in a F-1 score of 0.141, and an improved version which leveraged the Perceptual Similarity among images and obtained a final F-1 score of 0.360. We describe in detail our data analysis, our different approaches, and conclude by discussing some ideas for future work.
This article describes PUC Chile team’s participation in the Caption Prediction task of ImageCLEFmedical challenge 2021, which resulted in the team winning this task. We first show how a very simple approach based on statistical analysis of captions, without relying on images, results in a competitive baseline score. Then, we describe how to improve the performance of this preliminary submission by encoding the medical images with a ResNet CNN, pre-trained on ImageNet and later fine-tuned with the challenge dataset. Afterwards, we use this visual encoding as the input for a multi-label classification approach for caption prediction. We describe in detail our final approach, and we conclude by discussing some ideas for future work.
From administrative registers of last names in Santiago, Chile, we create a surname affinity network that encodes socioeconomic data. This network is a multi-relational graph with nodes representing surnames and edges representing the prevalence of interactions between surnames by socioeconomic decile. We model the prediction of links as a knowledge base completion problem, and find that sharing neighbors is highly predictive of the formation of new links. Importantly, We distinguish between grounded neighbors and neighbors in the embedding space, and find that the latter is more predictive of tie formation. The paper discusses the implications of this finding in explaining the high levels of elite endogamy in Santiago.
A graph generator is a tool which allows to create graph-like data whose structural properties are very similar to those found in real world networks. This paper presents two methods to generate graphs with power-law edge distribution based on the MapReduce processing model that can be easily implemented to run on top of Apache Hadoop. The proposed methods allow the generation of directed and undirected power-law distributed graphs without repeated edges. Our experimental evaluation shows that our methods are efficient and scalable in terms of both graph size and cluster capacity.
To avoid the «meaning conflation deficiency» of word embeddings, a number of models have aimed to embed individual word senses. These methods at one time performed well on tasks such as word sense induction (WSI), but they have since been overtaken by task-specific techniques which exploit contextualized embeddings. However, sense embeddings and contextualization need not be mutually exclusive. We introduce PolyLM, a method which formulates the task of learning sense embeddings as a language modeling problem, allowing contextualization techniques to be applied. PolyLM is based on two underlying assumptions about word senses: firstly, that the probability of a word occurring in a given context is equal to the sum of the probabilities of its individual senses occurring; and secondly, that for a given occurrence of a word, one of its senses tends to be much more plausible in the context than the others. We evaluate PolyLM on WSI, showing that it performs considerably better than previous sense embedding techniques, and matches the current state-of-the-art specialized WSI method despite having six times fewer parameters.
There is a resurgence of interest in political parties. This resurgent interest embraces a minimalist definition of political parties, according to which any group that competes in elections and receives a handful of votes qualifies as a party. Parties, however, are expected to contribute to democratic representation, and the party politics literature has extensively shown that many “parties” do not fulfill this expectation. These entities that possess some but not all defining features of political parties can be considered diminished subtypes of the category. A thorough conceptualization of diminished subtypes could improve the analytical value of the study of political parties and of other forms of electoral political organizations. In this article, therefore, we put forth a new typology of diminished subtypes of political parties based on the presence or absence of two primary attributes: horizontal coordination of ambitious politicians during electoral campaigns and while in office and vertical aggregation to electorally mobilize collective interests and to intermediate and channel collective demands.
This study places the “cognitive elaboration model” on news gathering and political behavior within the dual-processing “elaboration likelihood model” to derive hypotheses about the effects of incidental news exposure and tests them using two-wave panel data. Results indicate incidental news exposure predicts online participation but not offline participation – underlining the importance of differentiating between political behaviors in the two environments. The key finding, however, is that news elaboration mediates the positive relationship between incidental exposure and political participation, which is theorized as taking place through the peripheral route of elaboration – as opposed to intentional exposure, which engages the central route.
Continuous learning occurs naturally in human beings. However, Deep Learning methods suffer from a problem known as Catastrophic Forgetting (CF) that consists of a model drastically decreasing its performance on previously learned tasks when it is sequentially trained on new tasks. This situation, known as task interference, occurs when a network modifies relevant weight values as it learns a new task. In this work, we propose two main strategies to face the problem of task interference in convolutional neural networks. First, we use a sparse coding technique to adaptively allocate model capacity to different tasks avoiding interference between them. Specifically, we use a strategy based on group sparse regularization to specialize groups of parameters to learn each task. Afterward, by adding binary masks, we can freeze these groups of parameters, using the rest of the network to learn new tasks. Second, we use a meta learning technique to foster knowledge transfer among tasks, encouraging weight reusability instead of overwriting. Specifically, we use an optimization strategy based on episodic training to foster learning weights that are expected to be useful to solve future tasks. Together, these two strategies help us to avoid interference by preserving compatibility with previous and future weight values. Using this approach, we achieve state-of-the-art results on popular benchmarks used to test techniques to avoid CF. In particular, we conduct an ablation study to identify the contribution of each component of the proposed method, demonstrating its ability to avoid retroactive interference with previous tasks and to promote knowledge transfer to future tasks.
When learning tasks over time, artificial neural networks suffer from a problem known as Catastrophic Forgetting (CF). This happens when the weights of a network are overwritten during the training of a new task causing forgetting of old information. To address this issue, we propose MetA Reusable Knowledge or MARK, a new method that fosters weight reusability instead of overwriting when learning a new task. Specifically, MARK keeps a set of shared weights among tasks. We envision these shared weights as a common Knowledge Base (KB) that is not only used to learn new tasks, but also enriched with new knowledge as the model learns new tasks. Key components behind MARK are two-fold. On the one hand, a metalearning approach provides the key mechanism to incrementally enrich the KB with new knowledge and to foster weight reusability among tasks. On the other hand, a set of trainable masks provides the key mechanism to selectively choose from the KB relevant weights to solve each task. By using MARK, we achieve state of the art results in several popular benchmarks, surpassing the best performing methods in terms of average accuracy by over 10% on the 20-Split-MiniImageNet dataset, while achieving almost zero forgetfulness using 55% of the number of parameters. Furthermore, an ablation study provides evidence that, indeed, MARK is learning reusable knowledge that is selectively used by each task.
Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b , the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z , the number of phrases in the Lempel-Ziv parse of the text, which is computed in linear time and yields the least number of phrases when those can be copied only from the left. Almost nothing has been known for decades about the approximation ratio of z with respect to b . In this paper we prove that z=O(blog(n/b)) , where n is the text length. We also show that the bound is tight as a function of n , by exhibiting a text family where z=Ω(blogn) . Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating b with r , the number of equal-letter runs in the Burrows-Wheeler transform of the text. We continue by observing that Lempel-Ziv is just one particular case of greedy parses–meaning that it obtains the smallest parse by scanning the text and maximizing the phrase length at each step–, and of ordered parses–meaning that phrases are larger than their sources under some order. As a new example of ordered greedy parses, we introduce lexicographical parses, where phrases can only be copied from lexicographically smaller text locations. We prove that the size v of the optimal lexicographical parse is also obtained greedily in O(n) time, that v=O(blog(n/b)) , and that there exists a text family where v=Ω(blogn) . Interestingly, we also show that v=O(r) because r also induces a lexicographical parse…
Entrepreneurial learning experiences have become one of the key aspects of the state of the art in engineering education. As such, technology-focused entrepreneurship courses have been incorporated to engineering curricula-both in developed and developing countries. Following this trend, the Engineering School at Pontificia Universidad Católica de Chile (PUC-Engineering) designed a third-year compulsory course on research, entrepreneurship, and innovation, whose objective is to provide students with entrepreneurial skills that transcend time. To continuously improve this course, the Engineering Education Unit at PUC-Engineering has been conducting pre-and post-surveys, assessing self-efficacy and learning benefits related to various course methods. This paper describes the main lessons learned as a result of using this data-centric approach throughout the last six academic periods. We found that the course is perceived as beneficial by most of its students, and that project feedback sessions and project presentations report the highest perceived learning benefits. Besides, we describe some of the improvements to the course that have been pushed by assessment data, showing the importance of using a data-driven approach for engineering entrepreneurship education.
Puberty is a complex developmental process that varies considerably among individuals and populations. Genetic factors explain a large proportion of the variability of several pubertal traits. Recent genome-wide association studies (GWAS) have identified hundreds of variants involved in traits that result from body growth, like adult height. However, they do not capture many genetic loci involved in growth changes over distinct growth phases. Further, such GWAS have been mostly performed in Europeans, but it is unknown how these findings relate to other continental populations. In this study, we analyzed the genetic basis of three pubertal traits; namely, peak height velocity (PV), age at PV (APV) and height at APV (HAPV). We analyzed a cohort of 904 admixed Chilean children and adolescents with European and Mapuche Native American ancestries. Height was measured on roughly a 6−month basis from childhood to adolescence between 2006 and 2019. We predict that, in average, HAPV is 4.3 cm higher in European than in Mapuche adolescents (P = 0.042), and APV is 0.73 years later in European compared with Mapuche adolescents (P = 0.023). Further, by performing a GWAS on 774, 433 single-nucleotide polymorphisms, we identified a genetic signal harboring 3 linked variants significantly associated with PV in boys (P <5×10−8). This signal has never been associated with growth-related trait
How do political candidates combine social media campaign tools with on-the-ground political campaigns to pursue segmented electoral strategies? We argue that online campaigns can reproduce and reinforce segmented electoral appeals. Furthermore, our study suggests that electoral segmentation remains a broader phenomenon that includes social media as but one of many instruments by which to appeal to voters. To test our argument, we analyze the case of the 2017 legislative elections in Chile. We combine an analysis of Facebook and online electoral campaign data from 80 congressional campaigns that competed in three districts with ethnographic sources (i.e., campaigns observed on the ground and in-depth interviews with candidates). The results of this novel study suggest that intensive online campaigning mirrors offline segmentation.
The recent incidents involving Dr. Timnit Gebru, Dr. Margaret Mitchell, and Google have triggered an important discussion emblematic of issues arising from the practice of AI Ethics research. We offer this paper and its bibliography as a resource to the global community of AI Ethics Researchers who argue for the protection and freedom of this research community. Corporate, as well as academic research settings, involve responsibility, duties, dissent, and conflicts of interest. This article is meant to provide a reference point at the beginning of this decade regarding matters of consensus and disagreement on how to enact AI Ethics for the good of our institutions, society, and individuals. We have herein identified issues that arise at the intersection of information technology, socially encoded behaviors, and biases, and individual researchers’ work and responsibilities. We revisit some of the most pressing problems with AI decision-making and examine the difficult relationships between corporate interests and the early years of AI Ethics research. We propose several possible actions we can take collectively to support researchers throughout the field of AI Ethics, especially those from marginalized groups who may experience even more barriers in speaking out and having their research amplified. We promote the global community of AI Ethics researchers and the evolution of standards accepted in our profession guiding a technological future that makes life better for all.
Background In Chile, a patient needing a specialty consultation or surgery has to first be referred by a general practitioner, then placed on a waiting list. The Explicit Health Guarantees (GES in Spanish) ensures, by law, the maximum time to solve 85 health problems. Usually, a health professional manually verifies if each referral, written in natural language, corresponds or not to a GES-covered disease. An error in this classification is catastrophic for patients, as it puts them on a non-prioritized waiting list, characterized by prolonged waiting times. Methods To support the manual process, we developed and deployed a system that automatically classifies referrals as GES-covered or not using historical data. Our system is based on word embeddings specially trained for clinical text produced in Chile. We used a vector representation of the reason for referral and patient’s age as features for training machine learning models using human-labeled historical data. We constructed a ground truth dataset combining classifications made by three healthcare experts, which was used to validate our results. Results The best performing model over ground truth reached an AUC score of 0.94, with a weighted F1-score of 0.85 (0.87 in precision and 0.86 in recall). During seven months of continuous and voluntary use, the system has amended 87 patient misclassifications. Conclusion This system is a result of a collaboration between technical and clinical experts, and the design of the classifier was custom-tailored for a hospital’s clinical workflow, which encouraged the voluntary use of the platform. Our solution can be easily expanded across other hospitals since the registry is uniform in Chile.
Medical imaging is essential nowadays throughout medical education, research, and care. Accordingly, international efforts have been made to set large-scale image repositories for these purposes. Yet, to date, browsing of large-scale medical image repositories has been troublesome, time-consuming, and generally limited by text search engines. A paradigm shift, by means of a query-by-example search engine, would alleviate these constraints and beneficially impact several practical demands throughout the medical field. The current project aims to address this gap in medical imaging consumption by developing a content-based image retrieval (CBIR) system, which combines two image processing architectures based on deep learning. Furthermore, a first-of-its-kind intelligent visual browser was designed that interactively displays a set of imaging examinations with similar visual content on a similarity map, making it possible to search for and efficiently navigate through a large-scale medical imaging repository, even if it has been set with incomplete and curated metadata. Users may, likewise, provide text keywords, in which case the system performs a content- and metadata-based search. The system was fashioned with an anonymizer service and designed to be fully interoperable according to international standards, to stimulate its integration within electronic healthcare systems and its adoption for medical education, research and care. Professionals of the healthcare sector, by means of a self-administered questionnaire, underscored that this CBIR system and intelligent interactive visual browser would be highly useful for these purposes. Further studies are warranted to complete a comprehensive assessment of the performance of the system through case description and protocolized evaluations by medical imaging specialists.
Motivated by the analysis of range queries in databases, we introduce the computation of the depth distribution of a set of n d-dimensional boxes (i.e., axis aligned d-dimensional hyperrectangles), which generalizes the computation of the Klee’s measure and maximum depth of . We present an algorithm to compute the depth distribution running in time within (…), using space within (…), and refine these upper bound for various measures of difficulty of the input instances. Moreover, we introduce conditional lower bounds for this problem which not only provide insights on how fast the depth distribution can be computed, but also clarify the relation between the Depth Distribution problem and other fundamental problems in computer science.
Background: Eating disorders are psychological conditions characterized by unhealthy eating habits. Anorexia nervosa (AN) is defined as the belief of being overweight despite being dangerously underweight. The psychological signs involve emotional and behavioral issues. There is evidence that signs and symptoms can manifest on social media, wherein both harmful and beneficial content is shared daily.
Objective: This study aims to characterize Spanish-speaking users showing anorexia signs on Twitter through the extraction and inference of behavioral, demographical, relational, and multimodal data. By using the transtheoretical model of health behavior change, we focus on characterizing and comparing users at the different stages of the model for overcoming AN, including treatment and full recovery periods.
Methods: We analyzed the writings, posting patterns, social relationships, and images shared by Twitter users who underwent different stages of anorexia nervosa and compared the differences among users going through each stage of the illness and users in the control group (ie, users without AN). We also analyzed the topics of interest of their followees (ie, users followed by study participants). We used a clustering approach to distinguish users at an early phase of the illness (precontemplation) from those that recognize that their behavior is problematic (contemplation) and generated models for the detection of tweets and images related to AN. We considered two types of control users—focused control users, which are those that use terms related to anorexia, and random control users.
Results: We found significant differences between users at each stage of the recovery process (P<.001) and control groups. Users with AN tweeted more frequently at night, with a median sleep time tweets ratio (STTR) of 0.05, than random control users (STTR=0.04) and focused control users (STTR=0.03). Pictures were relevant for the characterization of users. Focused and random control users were characterized by the use of text in their profile pictures. We also found a strong polarization between focused control users and users in the first stages of the disorder. There was a strong correlation among the shared interests between users with AN and their followees (ρ=0.96). In addition, the interests of recovered users and users in treatment were more highly correlated to those corresponding to the focused control group (ρ=0.87 for both) than those of AN users (ρ=0.67), suggesting a shift in users’ interest during the recovery process.
Conclusions: We mapped the signs of AN to social media context. These results support the findings of previous studies that focused on other languages and involved a deep analysis of the topics of interest of users at each phase of the disorder. The features and patterns identified provide a basis for the development of detection tools and recommender systems.
Parties are central agents of democratic representation. The literature assumes that this function is an automatic consequence of social structure and/or a product of incentives derived from electoral competition. However, representation is contingent upon the organizational structure of parties. The connection between a party and an organized constituency is not limited to electoral strategy; it includes an organic connection through permanent formal or informal linkages that bind party programmatic positions to social groups’ preferences, regardless of the electoral returns. This article analyzes how the Movimiento al Socialismo (Movement toward Socialism, MAS) in Bolivia and the Frente Amplio (Broad Front, FA) in Uruguay developed two different forms of relationship with social organizations that result from the interplay of historical factors traceable to the parties’ formative phases and party organizational attributes. Party organizational features that grant voice to grassroots activists serve as crucial mechanisms for bottom-up incorporation of societal interests and demands.
Inorganic arsenic (As) is a toxic xenobiotic and carcinogen associated with severe health conditions. The urban population from the Atacama Desert in northern Chile was exposed to extremely high As levels (up to 600 µg/l) in drinking water between 1958 and 1971, leading to increased incidence of urinary bladder cancer (BC), skin cancer, kidney cancer, and coronary thrombosis decades later. Besides, the Andean Native-American ancestors of the Atacama population were previously exposed for millennia to elevated As levels in water (∼120 µg/l) for at least 5,000 years, suggesting adaptation to this selective pressure. Here, we performed two genome-wide selection tests—PBSn1 and an ancestry-enrichment test—in an admixed population from Atacama, to identify adaptation signatures to As exposure acquired before and after admixture with Europeans, respectively. The top second variant selected by PBSn1 was associated with LCE4A-C1orf68, a gene that may be involved in the immune barrier of the epithelium during BC. We performed association tests between the top PBSn1 hits and BC occurrence in our population. The strongest association (P = 0.012) was achieved by the LCE4A-C1orf68 variant. The ancestry-enrichment test detected highly significant signals (P = 1.3 × 10−9) mapping MAK16, a gene with important roles in ribosome biogenesis during the G1 phase of the cell cycle. Our results contribute to a better understanding of the genetic factors involved in adaptation to the pathophysiological consequences of As exposure.
Higher education institutions are increasingly considering the use of a form of blended learning, commonly named as flipped classroom (FC), in which students watch video lectures drawn from a massive online open course (MOOC) before a face-to-face lecture. This methodology is attractive, as it allows institutions to reuse high-quality material developed for MOOCs, while increasing learning flexibility and the students’ autonomy. However, the adoption of this methodology is low in general, especially in Engineering courses, as its implementation faces a number of challenges for students. The most salient challenge is the lack of student self-regulatory skills, which may result in frustration and low performance. In this paper, we study how a self-regulatory learning technological scaffold, which provides students with feedback about their activity in the MOOC, affects the engagement and performance of students in an Engineering course following a MOOC-based FC approach. To this end, we design an observational study with the participation of 242 students: 133 students in the experimental group (EG) who used a technological scaffold and 109 in the control group (CG) who did not. We did not find a statistically significant difference between the academic achievements of both groups. However, the EG exhibited a statistically significant greater engagement with the course and a more accurate strategic planning than the CG. The main implications for scaffolding self-regulated learning in FC derived from these results are discussed.
The power of app-driven mobile phones was first unleashed in 2011 when they were used to mobilize protesters and gain support for political movements in the United States and abroad. Mobile devices have since become the bedrock of political activism. To examine the influence of app reliance on offline and online political participation, this study builds on the Orientation-Stimulus-Reasoning-Orientation-Response (O-S-R-O-R) model by (a) applying the model to mobile apps, (b) testing whether trust in, and reliance on political discussion are mediators between reliance on apps and political participation, and (c) using trust in both offline and online discussion as measures of cognitive elaboration. This study’s path model suggests that app reliance is related to online political discussion, which, in turn, is related to online political participation, but not offline participation. Although both offline and online discussion are linked to offline and online trust in political discussion, trust in political discussion does not influence either offline or online political participation.
Despite widespread concern, research on the consequences of misinformation on people’s attitudes is surprisingly scant. To fill in this gap, the current study examines the long-term relationship between misinformation and trust in the news media. Based on the reinforcing spirals model, we analyzed data from a three-wave panel survey collected in Chile between 2017 and 2019. We found a weak, over-time relationship between misinformation and media skepticism. Specifically, initial beliefs on factually dubious information were negatively correlated with subsequent levels of trust in the news media. Lower trust in the media, in turn, was related over time to higher levels of misinformation. However, we found no evidence of a reverse, parallel process where media trust shielded users against misinformation, further reinforcing trust in the news media. The lack of evidence of a downward spiral suggests that the corrosive effects of misinformation on attitudes toward the news media are less serious than originally suggested. We close with a discussion of directions for future research.
Multi-agent pathfinding (MAPF) is the problem of finding k non-colliding paths connecting k given initial positions with k given goal positions on a given map. In its sum-of-costs variant, the total number of moves and wait actions performed by agents before they definitely reach the goal is minimized. Not surprisingly, since MAPF is combinatorial, a number of compilations to Boolean Satisfiability (SAT) and Answer Set Programming (ASP) exist. In this article, we describe in detail the first family of compilations to ASP that solve sum-of-costs MAPF over 4-connected grids. Compared to existing ASP compilations, a distinguishing feature of our compilation is that the number of total clauses (after grounding) grow linearly with the number of agents, while existing compilations grow quadratically. In addition, the optimization objective is such that its size after grounding does not depend on the size of the grid. In our experimental evaluation, we show that our approach outperforms search-based sum-of-costs MAPF solvers when grids are congested with agents. We also show that our approach is competitive with a SAT-based approach when follow conflicts are taken into account. We also explore the potential of our solver when finding makespanoptimal solutions, in which makespan is minimized first and then cost is minimized. Our results show that makespan-optimal solutions are slightly suboptimal in most benchmarks. Moreover, our MAPF solver, when run in that mode, is faster and scales better.
In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user.
In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and regular complex event processing queries and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words.
Video captioning is the task of predicting a semantic and syntactically correct sequence of words given some context video. The most successful methods for video captioning have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. In this paper, we address this limitation by considering syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. We tested our proposed architecture obtaining state-of-the-art results on two widely used video captioning datasets: the Microsoft Video Description (MSVD) dataset and the Microsoft Research Video-to-Text (MSR-VTT) dataset.
Urban Complex Systems. A Workshop Satellite of the Conference on Complex Systems 2021. October 27, 2021
In response to the covid-19 health crisis, many higher education institutions quickly moved to online education. As a result of that sudden switch, students faced unexpected difficulties, such as lack of a good quality internet connection, adequate equipment, and a good study environment. Additionally, several of them dealt with the effects of health and emotional situations faced by themselves or family members. Aware of those additional difficulties, some institutions promoted a flexible approach, suggesting teachers to increase communication with their students and make the necessary modifications to course evaluations and deadlines.
Teachers willing to approach their teaching in a more flexible manner need to make themselves aware of the needs of their students. In engineering massive courses, where student-teacher communication is usually burdened, gaining such an awareness is particularly difficult, requiring students to initiate communication. Unfortunately, in remote online settings, which may exacerbate social isolation, students may have less inclination to communicate with their teachers.
This work-in-progress paper describes a case of study in which we describe and evaluate a protocol designed to actively engage in communication with students either with lower-than-average academic performance or with missing/late assignments. Using soothing language, a member of the teaching staff contacts students (or replies to a request from a student), attempts to establish the causes of the low academic performance and proposes specific actions to be taken in response to students’ needs. The protocol was implemented in an advanced programming course during the second term (Fall) of 2020, at a large school of engineering in Latin America. To evaluate the student’s perceptions of this approach, we collect data from several sources, including general-purpose student evaluations and questionnaires designed to specifically evaluate the perceptions of this approach. By analyzing different sources of data, we aimed to identify advantages and opportunities for improvement and scaling this approach at a school level. Among the most important contributions, even though our protocol was designed and implemented during the pandemic, it could also be implemented face-to-face or with online systems.
Counting the number of words of a certain length accepted by a non-deterministic finite automaton (NFA) is a fundamental problem, which has many applications in different areas such as graph databases, knowledge compilation, and information extraction. Along with this, generating such words uniformly at random is also a relevant problem, particularly in scenarios where returning varied outputs is a desirable feature.
The previous problems are formalized as follows. The input of #NFA is an NFA N and a length k given in unary (that is, given as a string 0^k), and then the task is to compute the number of strings of length k accepted by N. The input of GEN-NFA is the same as #NFA, but now the task is to generate uniformly, at random, a string accepted by N of length k.
It is known that #NFA is #P-complete, so an efficient algorithm to compute this function exactly is not expected to exist. However, this does not preclude the existence of an efficient approximation algorithm for it. In this talk, we will show that #NFA admits a fully polynomial-time randomized approximation scheme (FPRAS). Prior to our work, it was open whether #NFA admits an FPRAS; in fact, the best randomized approximation scheme known for #NFA ran in time n^O(log(n)).
Besides, we will mention some consequences and applications of our results. In particular, from well-known results on counting and uniform generation, we obtain that GEN-NFA admits a fully polynomial-time almost uniform generator. Moreover, as #NFA is SpanL-complete under polynomial-time parsimonious reductions, we obtain that every function in the complexity class SpanL admits an FPRAS.
Multi-agent pathfinding (MAPF) is an NP-hard problem. As such, dense maps may be very hard to solve optimally. In such scenarios, compilation-based approaches, via Boolean satisfiability (SAT) and answer set programming (ASP), have proven to be most effective. In this paper, we propose a new encoding for MAPF, which we implement and solve using both ASP and MaxSAT solvers. Our encoding builds on a recent ASP encoding for MAPF but changes the way agent moves are encoded. This allows to represent swap and follow conflicts with binary clauses, which are known to work well along with conflict-based clause learning. For MaxSAT, we study different ways in which we may combine the MSU3 and LSU algorithms for maximum performance. Our results, over grid and warehouse maps, show that the ASP solver scales better when the number of agents is increased on grids with few obstacles, while the MaxSAT solver performs better in scenarios with more obstacles and fewer agents.
We present a representation of trajectories moving through the space without any constraint. It combines an in-memory cached index based on compact data structures and a classic disk-based strategy. The first structure allows some loss of precision that is refined with the second component. This approach reduces the number of accesses to disk. Comparing it with a classical index like the MVR-tree, this structure obtains competitive times in queries like time slice and knn, and sharply outperforms it in time interval queries. In addition it can solve other queries not supported with the MVR-tree. The space usage of our structure is 24 times less than that of the classical spatio-temporal index.
This tutorial serves as an introduction to deep learning approaches to build visual recommendation systems. Deep learning models can be used as feature extractors, and perform extremely well in visual recommender systems to create representations of visual items. This tutorial covers the foundations of convolutional neural networks and then how to use them to build state-of-the-art personalized recommendation systems. The tutorial is designed as a hands-on experience, focused on providing both theoretical knowledge as well as practical experience on the topics of the course.
The quality of the annotated data directly influences in the success of supervised NLP models. However, creating annotated datasets is often time-consuming and expensive. Although the annotation tool takes an important role, we know little about how it influences annotation quality. We compare the quality of annotations for the task of chat-untangling made by non-experts annotators using two different tools. The first is SLATE, an existing command-line based tool, and the second is Parlay, a new tool we developed that integrates mouse interaction and visual links. Our experimental results indicate that, while both tools perform similarly in terms of annotation quality, Parlay offers a significantly better user experience.
The success of pretrained word embeddings has motivated their use in the biomedical domain, with contextualized embeddings yielding remarkable results in several biomedical NLP tasks. However, there is a lack of research on quantifying their behavior under severe «stress» scenarios. In this work, we systematically evaluate three language models with adversarial examples — automatically constructed tests that allow us to examine how robust the models are. We propose two types of stress scenarios focused on the biomedical named entity recognition (NER) task, one inspired by spelling errors and another based on the use of synonyms for medical terms. Our experiments with three benchmarks show that the performance of the original models decreases considerably, in addition to revealing their weaknesses and strengths. Finally, we show that adversarial training causes the models to improve their robustness and even to exceed the original performance in some cases.
Given historical versions of an RDF graph, we propose and compare several methods to predict whether or not the results of a SPARQL query will change for the next version. Unsurprisingly, we find that the best results for this task are achievable by considering the full history of results for the query over previous versions of the graph. However, given a previously unseen query, producing historical results requires costly offline maintenance of previous versions of the data, and costly online computation of the query results over these previous versions. This prompts us to explore more lightweight alternatives that rely on features computed from the query and statistical summaries of historical versions of the graph. We evaluate the quality of the predictions produced over weekly snapshots of Wikidata and daily snapshots of DBpedia. Our results provide insights into the trade-offs for predicting SPARQL query dynamics, where we find that a detailed history of changes for a query’s results enables much more accurate predictions, but has higher overhead versus more lightweight alternatives.
Although academic search engines are important tools for researchers, they typically support limited (if any) geographical features. In this demo we present a system that allows researchers to search for a specific Computer Science research topic and visualize in a map which affiliations have publications matching their search. The dataset is based on DBLP, using Entity Linking (OpenTapioca) over author affiliations to find geographic metadata for publications from Wikidata.
We present a new task in educational NLP, recommend the best interventions to help special needs education professionals to work with students with different disabilities. We use the professionals’ observations of the students together with the students diagnosis and other chosen interventions to predict the best interventions for Chilean special needs students.
Dynamically-typed languages offer easy interaction with ad hoc data such as JSON and S-expressions; statically-typed languages offer powerful tools for working with structured data, notably algebraic datatypes, which are a core feature of typed languages both functional and otherwise. Gradual typing aims to reconcile dynamic and static typing smoothly. The gradual typing literature has extensively focused on the computational aspect of types, such as type safety, effects, noninterference, or parametricity, but the application of graduality to data structuring mechanisms has been much less explored. While row polymorphism and set-theoretic types have been studied in the context of gradual typing, algebraic datatypes in particular have not, which is surprising considering their wide use in practice. We develop, formalize, and prototype a novel approach to gradually structured data with algebraic datatypes. Gradually structured data bridges the gap between traditional algebraic datatypes and flexible data management mechanisms such as tagged data in dynamic languages, or polymorphic variants in OCaml. We illustrate the key ideas of gradual algebraic datatypes through the evolution of a small server application from dynamic to progressively more static checking, formalize a core functional language with gradually structured data, and establish its metatheory, including the gradual guarantees.
Static analysis tools typically address the problem of excessive false positives by requiring programmers to explicitly annotate their code. However, when faced with incomplete annotations, many analysis tools are either too conservative, yielding false positives, or too optimistic, resulting in unsound analysis results. In order to flexibly and soundly deal with partially-annotated programs, we propose to build upon and adapt the gradual typing approach to abstract-interpretation-based program analyses. Specifically, we focus on null-pointer analysis and demonstrate that a gradual null-pointer analysis hits a sweet spot, by gracefully applying static analysis where possible and relying on dynamic checks where necessary for soundness. In addition to formalizing a gradual null-pointer analysis for a core imperative language, we build a prototype using the Infer static analysis framework, and present preliminary evidence that the gradual null-pointer analysis reduces false positives compared to two existing null-pointer checkers for Infer. Further, we discuss ways in which the gradualization approach used to derive the gradual analysis from its static counterpart can be extended to support more domains. This work thus provides a basis for future analysis tools that can smoothly navigate the tradeoff between human effort and run-time overhead to reduce the number of reported false positives.
This is the fourth edition of the Workshop on Exploratory Search and Interactive Data Analytics (ESIDA). This series of workshops emerged as a response to the growing interest in developing new methods and systems that allow users to interactively explore large volumes of data, such as documents, multimedia, or specialized collections, such as biomedical datasets. There are various approaches to supporting users in this interactive environment, ranging from developing new algorithms through visualization methods to analyzing users’ search patterns. The overarching goal of ESIDA is to bring together researchers working in areas that span across multiple facets of exploratory search and data analytics to discuss and outline research challenges for this novel area.
A number of interfaces have been proposed in recent years to help users build SPARQL queries, including textual editors with syntax highlighting and error correction, and visual editors that allow for drawing graph patterns using node and edge components. A common feature supported by such systems is autocompletion, which offers users suggestions for terms to insert into a query, potentially restricted by a keyword prefix. However, current systems either return irrelevant terms
that will generate empty results, or return relevant terms but may time out while generating suggestions for complex queries. We propose an autocompletion technique based on a graph summary that aims to strike a balance by over-approximating relevant results in an efficient manner.
Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination, matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow looping for a number of steps that is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of recursion, and show that this suffices to express classical linear algebra algorithms. We study the expressive power of this language and show that it naturally corresponds to arithmetic circuit families, which are often said to capture linear algebra. Furthermore, we analyze several sub-fragments of our language, and show that their expressive power is closely tied to logical formalisms on semiring-annotated relations.
In recent years, instructional design has become even more challenging for teaching staff members in higher education institutions. If instructional design causes student overload, it could lead to superficial learning and decreased student well-being. A strategy to avoid overload is reflecting upon the effectiveness of teaching practices in terms of time-on-task. This article presents a Work-In-Progress conducted to provide teachers with a dashboard to visualize student self-reports of time-on-task regarding subject activities. A questionnaire was applied to 15 instructors during a set trial period to evaluate the perceived usability and usefulness of the dashboard. Preliminary findings reveal that the dashboard helped instructors became aware about the number of hours spent outside of class time. Furthermore, data visualizations of time-on-task evidence enabled them to redesign subject activities. Currently, the dashboard has been adopted by 106 engineering instructors. Future work involves the development of a framework to incorporate user-based improvements.
Valuable and timely information about crisis situations such as natural disasters, can be rapidly obtained from user-generated content in social media. This has created an emergent research field that has focused mostly on the problem of filtering and classifying potentially relevant messages during emergency situations. However, we believe important insight can be gained from studying online communications during disasters at a more comprehensive level. In this sense, a higher-level analysis could allow us to understand if there are collective patterns associated to certain characteristics of events. Following this motivation, we present a novel comparative analysis of 41 real-world crisis events. This analysis is based on textual and linguistic features of social media messages shared during these crises. For our comparison we considered hazard categories (i.e., human-induced and natural crises) as well as subcategories (i.e., intentional, accidental and so forth). Among other things, our results show that using only a small set of textual features, we can differentiate among types of events with 75% accuracy. Indicating that there are clear patterns in how people react to different extreme situations, depending on, for example, whether the event was triggered by natural causes or by human action. These findings have implications from a crisis response perspective, as they will allow experts to foresee patterns in emerging situations, even if there is no prior experience with an event of such characteristics.
We propose an end-to-end deep learning model for translating free-form natural language instructions to a high-level plan for behavioral robot navigation. We use attention models to connect information from both the user instructions and a topological representation of the environment. We evaluate our model’s performance on a new dataset containing 10,050 pairs of navigation instructions. Our model significantly outperforms baseline approaches. Furthermore, our results suggest that it is possible to leverage the environment map as a relevant knowledge base to facilitate the translation of free-form navigational instruction.
In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graph
Abstract: Many of the interactions between users on social networks are controversial, specially in polarized environments. In effect, rather than producing a space for deliberation, these environments foster the emergence of users that disqualify the position of others. On news sites, comments on the news are characterized by such interactions. This is detrimental to the construction of a deliberative and democratic climate, stressing the need for automatic tools that can provide an early detection of polarization and controversy. We introduce GENE (graph generation conditioned on named entities), a representation of user networks conditioned on the named entities (personalities, brands, organizations) which users comment upon. GENE models the leaning that each user has concerning entities mentioned in the news. GENE graphs is able to segment the user network according to their polarity. Using the segmented network, we study the performance of two controversy indices, the existing Random Walks Controversy (RWC) and another one we introduce, Relative Closeness Controversy (RCC). These indices measure the interaction between the network’s poles providing a metric to quantify the emergence of controversy. To evaluate the performance of GENE, we model the network of users of a popular news site in Chile, collecting data in an observation window of more than three years. A large-scale evaluation using GENE, on thousands of news, allows us to conclude that over 60% of user comments have a predictable polarity. This predictability of the user interaction scenario allows both controversy indices to detect a controversy successfully. In particular, our introduced RCC index shows satisfactory performance in the early detection of controversies using partial information collected during the first hours of the news event, with a sensitivity to the target class exceeding 90%.
Works on knowledge graphs and graph-based data management often focus either on graph query languages or on frameworks for graph analytics, where there has been little work in trying to combine both approaches. However, many real-world tasks conceptually involve combinations of these approaches: a graph query can be used to select the appropriate data, which is then enriched with analytics, and then possibly filtered or combined again with other data by means of a query language. In this paper we propose a language that is well-suited for both graph querying and analytical tasks. We propose a minimalistic extension of SPARQL to allow for expressing analytical tasks over existing SPARQL infrastructure; in particular, we propose to extend SPARQL with recursive features, and provide a formal syntax and semantics for our language. We show that this language can express key analytical tasks on graphs (in fact, it is Turing complete). Moreover, queries in this language can also be compiled into sequences of iterations of SPARQL update statements. We show how procedures in our language can be implemented over off-the-shelf SPARQL engines, with a specialised client that can leverage database operations to improve the performance of queries. Results for our implementation show that procedures for popular analytics currently run in seconds or minutes for selective sub-graphs (our target use-case).
Abstract. We investigate global measures of vertex similarity for knowledge graphs. While vertex similarity has been explored in the context of directed, unlabelled graphs, measures based on recursive algorithms or learning frameworks can be costly to compute, assume labelled data, and/or provide poorly-interpretable results. Knowledge graphs further imply unique challenges for vertex similarity in terms of scale and diversity. We thus propose and explore global measures of vertex similarity for Knowledge Graphs that (i) are unsupervised, (ii) offer explanations of similarity results; (iii) take into consideration edge labels; and (iv) are robust in terms of redundant or interdependent information. Given that these measures can still be costly to compute precisely, we propose an approximation strategy that enables computation at scale. We compare our measures with a recursive measure (SimRank) for computing vertex similarity over subsets of Wikidata.
Abstract. We explore solutions for representing archives of versioned RDF data using the SPARQL standard and off-the-shelf engines. We consider six representations of RDF archives based on named graphs, and describe how input queries can be automatically rewritten to return solutions for a particular version, or solutions that change between versions. We evaluate these alternatives over an archive of 8 weekly versions of Wikidata and 146 queries using Virtuoso as the SPARQL engine.
Abstract. Given a Wikidata claim, we explore automated methods for locating references that support that claim. Our goal is to assist human editors in referencing claims, and thus increase the ratio of referenced claims in Wikidata. As an initial approach, we mine links from the references section of English Wikipedia articles, download and index their content, and use standard relevance-based measures to find supporting documents. We consider various forms of search phrasings, as well as different scopes of search. We evaluate our methods in terms of the coverage of reference documents collected from Wikipedia. We also develop a gold standard of sample items for evaluating the relevance of suggestions. Our results in general reveal that the coverage of Wikipedia reference documents for claims is quite low, but where a reference document is available, we can often suggest it within the first few results.
We propose laconic classification as a novel way to understand and compare the performance of diverse image classifiers. The goal in this setting is to minimise the amount of information (aka. entropy) required in individual test images to maintain correct classification. Given a classifier and a test image, we compute an approximate minimal-entropy positive image for which the classifier provides a correct classification, becoming incorrect upon any further reduction. The notion of entropy offers a unifying metric that allows to combine and compare the effects of various types of reductions (e.g., crop, colour reduction, resolution reduction) on classification performance, in turn generalising similar methods explored in previous works. Proposing two complementary frameworks for computing the minimal-entropy positive images of both human and machine classifiers, in experiments over the ILSVRC test-set, we find that machine classifiers are more sensitive entropy-wise to reduced resolution (versus cropping or reduced colour for machines, as well as reduced resolution for humans), supporting recent results suggesting a texture bias in the ILSVRC-trained models used. We also find, in the evaluated setting, that humans classify the minimal-entropy positive images of machine models with higher precision than machines classify those of humans.
In these lecture notes, we provide an overview of some of the high-level research directions and open questions relating to knowledge graphs. We discuss six high-level concepts relating to knowledge graphs: data models, queries, ontologies, rules, embeddings and graph neural networks. While traditionally these concepts have been explored by different communities in the context of graphs, more recent works have begun to look at how they relate to one another, and how they can be unified. In fact, at a more foundational level, we can find some surprising relations between the different concepts. The research questions we explore mostly involve combinations of these concepts.
In this chapter, we provide a detailed primer on the second version of the Web Ontology Language (OWL 2) standard. We first motivate the need for such a standard, discussing the role and importance of ontologies on the Web. We then describe how ontology languages, which themselves can be formally defined through model theory, can subsequently be used to formally define ontologies. Thereafter we discuss the OWL vocabulary used to define the semantics of classes, properties, individuals, and datatypes within ontologies. We cover some of the main reasoning tasks for ontologies and the applications in which they are used. We discuss how these core reasoning tasks are undecidable for the full OWL (2) language and outline the sub-languages (aka. profiles) proposed by the standard that allow for more efficient reasoning procedures. We conclude by reflecting on the importance of having expressive ontologies on the Web of Data, and discuss open challenges.
This book concisely brings together the key standards and best practices relating to modelling, querying, validating and linking machine-readable data and semantics on the Web. Alongside practical examples and formal definitions, the book shows how these standards contribute to – and have been used thus far on – the «Web of Data»: a machine readable evolution of the Web marked by increased automation, enabling powerful Web applications capable of discovering, cross-referencing, and organising data from numerous websites in a matter of seconds. The book is divided into nine chapters, the first of which highlights the fundamental shortcomings of the current Web that illustrate the need for increased machine readability. The next chapter outlines the core concepts of the “Web of Data”, discussing use-cases on the Web where they have already been deployed. “Resource Description Framework (RDF)” describes the graph-structured data model proposed by the Semantic Web community as a common data model for the Web. The chapter on “RDF Schema (RDFS) and Semantics” presents a lightweight ontology language used to define an initial semantics for RDF graphs. In turn, the chapter “Web Ontology Language (OWL)” elaborates on a much more expressive ontology language built upon RDFS. In “SPARQL Query Language” a language for querying and updating RDF graphs is described. “Shape Constraints and Expressions (SHACL/ShEx)” introduces two languages for describing the expected structure of – and expressing constraints over – RDF graphs for the purposes of validation. “Linked Data” discusses the principles and best practices by which interlinked (RDF) data can be published on the Web, and how they have been adopted. The final chapter highlights open problems and concludes with a general discussion on the future of the Web of Data. The book is intended for students, researchers and advanced practitioners interested in learning more about the Web of Data, and about closely related topics such as the Semantic Web, Knowledge Graphs, Linked Data, Graph Databases, Ontologies, etc. Offering a range of accessible examples and exercises, it can be used as a textbook for students and other newcomers to the field. It can also serve as a reference handbook for researchers and developers, as it offers up-to-date details on key standards (RDF, RDFS, OWL, SPARQL, SHACL, ShEx, RDB2RDF, LDP), along with formal definitions and references to further literature. The associated website webofdatabook.org offers a wealth of complementary material, including solutions to the exercises, slides for classes, interactive examples, and a section for comments and questions.
This chapter provides a detailed introduction to the SPARQL Protocol and RDF Query Language (SPARQL 1.1): the standard query language for RDF. After some initial motivation, we delve into the features of the query language, illustrated with concrete examples. We then formally define the semantics of these query features. We next discuss how federated queries can be used to evaluate queries over multiple remote sources on the Web. We detail the SPARQL Update language, which allows for modifying the data indexed by a SPARQL query service. We introduce SPARQL Entailment Profiles, which allow for query results to consider entailments, including support for RDF, RDFS and OWL semantics. We further discuss the HTTP-based protocol by which requests can be issued to a SPARQL service over the Web, as well as the SPARQL Service Description vocabulary, which can be used to describe and advertise the features supported by such services. We conclude by discussing the importance of SPARQL for the Web of Data, the key research directions that are currently being explored, as well as open challenges.
In this chapter, we introduce two languages for describing shapes and constraints for RDF graphs, namely the Shapes Constraint Language (SHACL) and the Shape Expressions Language (ShEx 2.1). Both languages allow for defining constraints over RDF graphs in terms of what data are expected, what data are obligatory, what data are allowed, and what data are disallowed. This in turn allows RDF graphs to be validated with respect to the specified constraints. We first look at SHACL, describing the SHACL-Core fragment and the constraints it allows. We then discuss how SHACL-SPARQL allows for further constraints to be expressed using SPARQL query syntax. Turning to ShEx, we describe its syntaxes, and how it differs from SHACL. We outline and provide a semantics for an abstract shapes syntax that generalises SHACL and ShEx. We conclude with a general discussion of the role of shapes languages on the Web of Data, as well as open challenges.
This chapter provides a detailed primer for the Resource Description Framework (RDF 1.1) standard, proposed as a common data model for publishing and exchanging structured data on the Web. We first motivate the need for a data model like RDF. We then describe the types of terms used in RDF: the basic building blocks of the framework. We discuss how these terms can be combined to make coherent statements in the form of RDF triples, and how triples form graphs and datasets. Thereafter we discuss the RDF vocabulary: a built-in set of terms used for modeling more complex data, such as complex relations and ordered lists. Finally, we give an overview of the different syntaxes by which RDF can be serialized and communicated.
This chapter presents an in-depth primer for the RDF Schema (RDFS 1.1) standard, which is primarily used to define a lightweight semantics for the classes and properties used in RDF graphs. After an initial motivation and overview, we discuss the RDFS vocabulary, and how it can be used to define sub-classes, sub-properties, domain and ranges, amongst other types of definitions. We then describe in detail how the semantics of RDF(S) can be formalised in a model-theoretic way, discussing key concepts such as interpretations, models, satisfiability and entailment. We introduce various semantics for RDF(S), including the simple semantics, D semantics, RDF semantics, and the RDFS semantics. We conclude the chapter by discussing how rules can be used to support entailment under such semantics.
This chapter motivates, introduces and describes Linked Data, which centres around a concise set of principles by which data can be published and interlinked on the Web, and by which a Web of Data can ultimately be formed. We first discuss the core Linked Data principles, which espouse the use of HTTP IRIs to identify the entities described in data, returning a machine-readable description of the entity (typically RDF) when its corresponding IRI is looked up on the Web. We then discuss some further best practices for publishing data conformant with the Linked Data principles in a way that enhances interoperability. We discuss the Linking Open Data (LOD) project founded on the idea of publishing Open Data on the Web in a standard, machine-readable fashion using Linked Data; we describe the most prominent datasets and vocabularies that have results from this initiative. We then discuss tools and techniques for converting legacy data to RDF, discovering links, and hosting Linked Data. We subsequently discuss the Linked Data Platform: a standard that outlines the protocols and resources needed to build a new generation of read–write Linked Data applications. We conclude the chapter with a discussion of open challenges yet to be addressed in the context of Linked Data.
Abstract: In this work, we focus on the Visual Question Answering (VQA) task, where a model must answer a question based on an image, and the VQA-Explanations task, where an explanation is produced to support the answer. We introduce an interpretable model capable of pointing out and consuming information from a novel Knowledge Base (KB) composed of real-world relationships between objects, along with labels mined from available region descriptions and object annotations. Furthermore, this model provides a visual and textual explanations to complement the KB visualization. The use of a KB brings two important consequences: enhance predictions and improve interpretability. We achieve this by introducing a mechanism that can extract relevant information from this KB, and can point out the relations better suited for predicting the answer. A supervised attention map is generated over the KB to select the relevant relationships from it for each question-image pair. Moreover, we add image attention supervision on the explanations module to generate better visual and textual explanations. We quantitatively show that the predicted answers improve when using the KB; similarly, explanations improve with this and when adding image attention supervision. Also, we qualitatively show that the KB attention helps to improve interpretability and enhance explanations. Overall, the results support the benefits of having multiple tasks to enhance the interpretability and performance of the model.
The Entity Linking (EL) task involves linking mentions of entities in a text with their identifier in a Knowledge Base (KB) such as Wikipedia, BabelNet, DBpedia, Freebase, Wikidata, YAGO, etc. Numerous techniques have been proposed to address this task down through the years. However, not all works adopt the same convention regarding the entities that the EL task should target; for example, while some EL works target common entities like “interview” appearing in the KB, others only target named entities like “Michael Jackson”. The lack of consensus on this issue (and others) complicates research on the EL task; for example, how can the performance of EL systems be evaluated and compared when systems may target different types of entities? In this work, we first design a questionnaire to understand what kinds of mentions and links the EL research community believes should be targeted by the task. Based on these results we propose a fine-grained categorization scheme for EL that distinguishes different types of mentions and links. We propose a vocabulary extension that allows to express such categories in EL benchmark datasets. We then relabel (subsets of) three popular EL datasets according to our novel categorization scheme, where we additionally discuss a tool used to semi-automate the labeling process. We next present the performance results of five EL systems for individual categories. We further extend EL systems with Word Sense Disambiguation and Coreference Resolution components, creating initial versions of what we call Fine-Grained Entity Linking (FEL) systems, measuring the impact on performance per category. Finally, we propose a configurable performance measure based on fuzzy sets that can be adapted for different application scenarios Our results highlight a lack of consensus on the goals of the EL task, show that the evaluated systems do indeed target different entities, and further reveal some open challenges for the (F)EL task regarding more complex forms of reference for entities.
Abstract: This paper presents a novel attention-based algorithm for achieving adaptive computation called DACT, which, unlike existing ones, is end-to-end differentiable. Our method can be used in conjunction with many networks; in particular, we study its application to the widely know MAC architecture, obtaining a significant reduction in the number of recurrent steps needed to achieve similar accuracies, therefore improving its performance to computation ratio. Furthermore, we show that by increasing the maximum number of steps used, we surpass the accuracy of even our best non-adaptive MAC in the CLEVR dataset, demonstrating that our approach is able to control the number of steps without significant loss of performance. Additional advantages provided by our approach include considerably improving interpretability by discarding useless steps and providing more insights into the underlying reasoning process. Finally, we present adaptive computation as an equivalent to an ensemble of models, similar to a mixture of expert formulation. Both the code and the configuration files for our experiments are made available to support further research in this area.
This volume is dedicated to invited papers from the 22nd edition of the International Conference on Database Theory (ICDT 2019), held in Lisbon, Portugal, on March 26–29, 2019. The ICDT conference is one of the leading venues in database theory and foundations of data management. This is an exciting area at the core of efficient and effective data management, with further connections to knowledge representation, relational statistical learning, logical aspects of computation, verification, and, of course, practical aspects of data processing.
Based on the results of the reviewing process and oral presentations given at the conference, a group of program committee members from ICDT 2019 selected five articles from the conference to be invited to this special issue. These are among the finest contributions from ICDT 2019. The invited papers were further reviewed according to the journal’s rigorous peer-review standards.
The five selected papers deal with different, yet timely problems in the area of foundations of data management. In particular:
The paper “Characterizing Tractability of Simple Well-designed Pattern Trees with Projection”, by Stefan Mengel and Sebastian Skritek, deals with the challenging problem of characterizing which well-designed pattern trees can be evaluated efficiently. Such well-designed pattern trees lie at the core of modern languages for semantic web and graph databases.
The paper “Index-Based, High-Dimensional, Cosine Threshold Querying with Optimality Guarantees”, by Yuliang Li Jianguo Wang Benjamin Pullman Nuno Bandeira, and Yannis Papakonstantinou, investigates algorithmic issues related to cosine similarity queries over vector databases. This problem is of practical importance, arising in important applications such as document retrieval recommender systems, and mass spectrometry. The paper deals with the efficient evaluation of such queries and provides novel optimality guarantees.
The paper “Semi-Oblivious Chase Termination: The Sticky Case” by Marco Calautti and Andreas Pieris, studies termination of the chase, a fundamental algorithmic tool in database theory with several applications. The paper studies when the chase terminates regardless of the input data. While in general this problem is undecidable, the authors provide an elegant analysis showing that it can be solved in elementary time for the class of sticky rules, a prominent paradigm for obtaining decidability of rule-based reasoning tasks.
The paper “Consistent Query Answering for Primary Keys in Datalog” by Paraschos Koutris and Jef Wijsen, considers the classical problem of evaluating consistent answers to conjunctive queries over databases that may violate primary key constraints. The authors show the surprising result that for any self-join free conjunctive query for which this problem can be solved in polynomial time, it can also be solved in LOGSPACE.
Finally, the paper “On the expressive power of linear algebra on graphs”, by Floris Geerts, studies the expressive power of Matlang, a recently proposed query language for specifying both relational and linear algebra properties over matrices. This beautiful paper characterizes when Matlang can distinguish between two graphs represented by their adjacency matrices, in terms of both spectral and combinatorial properties.
Abstract: We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Semantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Semantic Web resources (languages/ontologies/knowledge-bases/tools) to improve Information Extraction, and/or using Information Extraction to populate the Semantic Web. In more detail, we focus on the extraction and linking of three elements: entities, concepts and relations. Extraction involves identifying (textual) mentions referring to such elements in a given unstructured or semi-structured input source. Linking involves associating each such mention with an appropriate disambiguated identifier referring to the same element in a Semantic Web knowledge-base (or ontology), in some cases creating a new identifier where necessary. With respect to entities, works involving (Named) Entity Recognition, Entity Disambiguation, Entity Linking, etc. in the context of the Semantic Web are considered. With respect to concepts, works involving Terminology Extraction, Keyword Extraction, Topic Modeling, Topic Labeling, etc., in the context of the Semantic Web are considered. Finally, with respect to relations, works involving Relation Extraction in the context of the Semantic Web are considered. The focus of the majority of the survey is on works applied to unstructured sources (text in natural language); however, we also provide an overview of works that develop custom techniques adapted for semi-structured inputs, namely markup documents and web tables.
Abstract: CNN-based models currently provide state-of-the-art performance in image categorization tasks. While these methods are powerful in terms of representational capacity, they are generally not conceived with explicit means to control complexity. This might lead to scenarios where resources are used in a non-optimal manner, increasing the number of unspecialized or repeated neurons, and overfitting to data. In this work we propose CompactNets, a new approach to visual recognition that learns a hierarchy of shared, discriminative, specialized, and compact representations. CompactNets naturally capture the notion of compositional compactness, a characterization of complexity in compositional models, consisting on using the smallest number of patterns to build a suitable visual representation. We employ a structural regularizer with group-sparse terms in the objective function, that induces on each layer, an efficient and effective use of elements from the layer below. In particular, this allows groups of top-level features to be specialized based on category information. We evaluate CompactNets on the ILSVRC12 dataset, obtaining compact representations and competitive performance, using an order of magnitude less parameters than common CNN-based approaches. We show that CompactNets are able to outperform other group-sparse-based approaches, in terms of performance and compactness. Finally, transfer-learning experiments on small-scale datasets demonstrate high generalization power, providing remarkable categorization performance with respect to alternative approaches.
GPS-enabled devices and social media popularity have created an unprecedented opportunity for researchers to collect, explore, and analyze text data with fine-grained spatial and temporal metadata. In this sense, text, time and space are different domains with their own representation scales and methods. This poses a challenge on how to detect relevant patterns that may only arise from the combination of text with spatio-temporal elements. In particular, spatio-temporal textual data representation has relied on feature embedding techniques. This can limit a model’s expressiveness for representing certain patterns extracted from the sequence structure of textual data. To deal with the aforementioned problems, we propose an Acceptor recurrent neural network model that jointly models spatio-temporal textual data. Our goal is to focus on representing the mutual influence and relationships that can exist between written language and the time-and-place where it was produced. We represent space, time, and text as tuples, and use pairs of elements to predict a third one. This results in three predictive tasks that are trained simultaneously. We conduct experiments on two social media datasets and on a crime dataset; we use Mean Reciprocal Rank as evaluation metric. Our experiments show that our model outperforms state-of-the-art methods ranging from a 5.5% to a 24.7% improvement for location and time prediction.
Societies and industries are rapidly changing due to the adoption of artificial intelligence (AI) and will face deep transformations in upcoming years. In this scenario, it becomes critical for under-represented communities in technology, in particular developing countries like Latin America, to foster initiatives that are committed to developing tools for the local adoption of AI. Latin America, as well as many non-English speaking regions, face several problems for the adoption of AI technology, including the lack of diverse and representative resources for automated learning tasks. A highly problematic area in this regard is natural language processing (NLP), which is strongly dependent on labeled datasets for learning. However, most state-of-the-art NLP resources are allocated to English. Therefore, creating efficient NLP tools for diverse languages requires an important investment of time and financial resources. To deal with such issues, our group has worked toward creating language-agnostic approaches as well as adapting and improving existing NLP techniques to local problems. In addition, we have focused on producing new state-of-the-art NLP publicly available data and models in Spanish. Next, we briefly present some of them.
Abstract: More than two decades have passed since the establishment of the initial cornerstones of the Semantic Web. Since its inception, opinions have remained divided regarding the past, present and potential future impact of the Semantic Web. In this paper – and in light of the results of over two decades of development on both the Semantic Web and related technologies – we reflect on the current status of the Semantic Web, the impact it has had thus far, and future challenges. We first review some of the external criticism of this vision that has been put forward by various authors; we draw together the individual critiques, arguing both for and against each point based on the current state of adoption. We then present the results of a questionnaire that we have posed to the Semantic Web mailing list in order to understand respondents’ perspective(s) regarding the degree to which the original Semantic Web vision has been realised, the impact it can potentially have on the Web (and other settings), its success stories thus far, as well as the degree to which they agree with the aforementioned critiques of the Semantic Web in terms of both its current state and future feasibility. We conclude by reflecting on future challenges and opportunities in the area.
The intricate details of how proteins bind to proteins, DNA, and RNA are crucial for the understanding of almost all biological processes. Disease-causing sequence variants often affect binding residues. Here, we described a new, comprehensive system of in silico methods that take only protein sequence as input to predict binding of protein to DNA, RNA, and other proteins. Firstly, we needed to develop several new methods to predict whether or not proteins bind (per-protein prediction). Secondly, we developed independent methods that predict which residues bind (per-residue). Not requiring three-dimensional information, the system can predict the actual binding residue. The system combined homology-based inference with machine learning and motif-based profile-kernel approaches with word-based (ProtVec) solutions to machine learning protein level predictions. This achieved an overall non-exclusive three-state accuracy of 77% ± 1% (±one standard error) corresponding to a 1.8 fold improvement over random (best classification for protein–protein with F1 = 91 ± 0.8%). Standard neural networks for per-residue binding residue predictions appeared best for DNA-binding (Q2 = 81 ± 0.9%) followed by RNA-binding (Q2 = 80 ± 1%) and worst for protein–protein binding (Q2 = 69 ± 0.8%).
We introduce the Automatic Learning for the Rapid Classification of Events (ALeRCE) broker, an astronomical alert broker designed to provide a rapid and self–consistent classification of large etendue telescope alert streams, such as that provided by the Zwicky Transient Facility (ZTF) and, in the future, the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST). ALeRCE is a Chilean–led broker run by an interdisciplinary team of astronomers and engineers, working to become intermediaries between survey and follow–up facilities. ALeRCE uses a pipeline which includes the real–time ingestion, aggregation, cross–matching, machine learning (ML) classification, and visualization of the ZTF alert stream. We use two classifiers: a stamp–based classifier, designed for rapid classification, and a light–curve–based classifier, which uses the multi–band flux evolution to achieve a more refined classification. We describe in detail our pipeline, data products, tools and services, which are made public for the community (see \url{https://alerce.science}). Since we began operating our real–time ML classification of the ZTF alert stream in early 2019, we have grown a large community of active users around the globe. We describe our results to date, including the real–time processing of 9.7×107 alerts, the stamp classification of 1.9×107 objects, the light curve classification of 8.5×105 objects, the report of 3088 supernova candidates, and different experiments using LSST-like alert streams. Finally, we discuss the challenges ahead to go from a single-stream of alerts such as ZTF to a multi–stream ecosystem dominated by LSST.
Voice break, as a landmark of advanced male puberty in genome-wide association studies (GWAS), has revealed that pubertal timing is a highly polygenic trait. Although voice break is easily recorded in large cohorts, it holds quite low precision as a marker of puberty. In contrast, gonadarche and pubarche are early and clinically well-defined measures of puberty onset.
This study examines the articulation of public opinion about so-called fake news using a national survey (N = 510) of U.S. adults conducted in 2018. We coded respondents’ open-ended answers about what is “fake news” and found that while some respondents adopted a politically neutral, descriptive definition, others provided a partisan, accusatory answer. Specifically, the weaponization of fake news was evident in the way respondents used the term to blame adversarial political and media targets. Perceptions of fake news prevalence, partisanship strength, and political interest were associated with a higher likelihood of providing a politicized and accusatory response about fake news. Accusations were polarized as a function of partisan identity and positively correlated with affective polarization. Results are discussed in light of the linguistic distinction of the term and what it means in the context of news media distrust and polarization.
In light of concerns about decreasing news use, a decline in interest in political news or even active avoidance or resistance of news in general, the idea of ‘incidental news’ has been seen as a possible remedy. Generally, ‘incidental news’ refers to the ways in which people encounter information about current events through media when they were not actively seeking the news. However, scholars studying incidental news through different theoretical and methodological perspectives have been arriving at differing evaluations of the significance and implications of this phenomenon – to the extent of downright contradictory findings. This introductory piece posits the aim of this special issue on Studying Incidental News: a conceptual clarification of incidental news exposure. In this issue, scholars coming from different approaches, ranging from cognitive processing, ecological models, emergent practices and a focus on platform affordances, show how different theoretical perspectives help account for various dimensions of incidental news consumption, and thus help explain the often conflicting findings that have been suggested so far.
Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed?
RDF databases and graph databases are two approaches of data management which are based on modeling, storing and querying data following a graph structure. RDF databases are based on a single graph data model which allows to describe Web resources in terms of their relations and attributes. On the other hand, most graph databases are based on the property graph data model, a type of graph where nodes and edges can contain properties represented as key-value pairs. This paper presents two methods for transforming RDF data into property graphs. The first method defines schema and data transformations as it assumes the existence of an RDF schema. The second method is schema-independent, so it allows to transform any kind of RDF dataset. Both methods are useful to store RDF data into a Graph Data Management System.
RDF triplestores and property graph databases are two approaches for data management which are based on modeling, storing and querying graph-like data. In spite of such common principle, they present special features that complicate the task of database interoperability. While there exist some methods to transform RDF graphs into property graphs, and vice versa, they lack compatibility and a solid formal foundation. This paper presents three direct mappings (schema-dependent and schema-independent) for transforming an RDF database into a property graph database, including data and schema. We show that two of the proposed mappings satisfy the properties of semantics preservation and information preservation. The existence of both mappings allows us to conclude that the property graph data model subsumes the information capacity of the RDF data model.
RDF and Property Graphs are data models that are being used to represent Knowledge Graphs. The definition of methods to transform RDF data into Property graph data is fundamental to allow interoperability among the systems using these models. Although both models are based on a graph structure, they have special features that complicate the definition of data transformation methods. This article presents an ontology-based approach to transform (automatically) property graphs into RDF graphs. The ontology, called PGO, defines a set of terms that allows describing the elements of a property graph. The algorithm corresponding to the transformation method is described, and some properties of the method are discussed (complexity, data preservation, and monotonicity). The results of an experimental evaluation are also presented.
In the field of protein engineering and biotechnology, the discovery and characterization of structural patterns is highly relevant as these patterns can give fundamental insights into protein-ligand interaction and protein function. This paper presents GSP4PDB, a bioinformatics web tool that enables the user to visualize, search and explore protein-ligand structural patterns within the entire Protein Data Bank.
n this paper we survey our recent results characterizing various graph neural network (GNN) architectures in terms of their ability to classify nodes over graphs, for classifiers based on unary logical formulas- or queries. We focus on the language FOC2, a well-studied fragment of FO. This choice is motivated by the fact that FOC2 is related to theWeisfeiler-Lehman (WL) test for checking graph isomorphism, which has the same ability as GNNs for distinguishing nodes on graphs. We unveil the exact relationship between FOC2 and GNNs in terms of node classification. To tackle this problem, we start by studying a popular basic class of GNNs, which we call AC-GNNs, in which the features of each node in a graph are updated, in successive layers, according only to the features of its neighbors. We prove that the unary FOC2 formulas that can be captured by an AC-GNN are exactly those that can be expressed in its guarded fragment, which in turn corresponds to graded modal logic. This result implies in particular that ACGNNs are too weak to capture all FOC2 formulas. We then seek for what needs to be added to AC-GNNs for capturing all FOC2. We show that it suffices to add readouts layers, which allow updating the node features not only in terms of its neighbors, but also in terms of a global attribute vector. We call GNNs with readouts ACR-GNNs. We also describe experiments that validate our findings by showing that, on synthetic data conforming to FOC2 but not to graded modal logic, AC-GNNs struggle to fit in while ACR-GNNs can generalise even to graphs of sizes not seen during training.
We study the expressive power of the LARA language — a recently proposed unified model for expressing relational and linear algebra operations — both in terms of traditional database query languages and some analytic tasks often performed in machine learning pipelines. We start by showing LARA to be expressive complete with respect to first-order logic with aggregation. Since LARA is parameterized by a set of user-defined functions which allow to transform values in tables, the exact expressive power of the language depends on how these functions are defined. We distinguish two main cases depending on the level of genericity queries are enforced to satisfy. Under strong genericity assumptions the language cannot express matrix convolution, a very important operation in current machine learning operations. This language is also local, and thus cannot express operations such as matrix inverse that exhibit a recursive behavior. For expressing convolution, one can relax the genericity requirement by adding an underlying linear order on the domain. This, however, destroys locality and turns the expressive power of the language much more difficult to understand. In particular, although under complexity assumptions the resulting language can still not express matrix inverse, a proof of this fact without such assumptions seems challenging to obtain.
We consider the problem of exact probabilistic inference for Union of Conjunctive Queries (UCQs) on tuple-independent databases. For this problem, two approaches currently coexist. In the extensional method, query evaluation is performed by exploiting the structure of the query, and relies heavily on the use of the inclusion–exclusion principle. In the intensional method, one first builds a representation of the lineage of the query in a tractable formalism of knowledge compilation. The chosen formalism should then ensure that the probability can be efficiently computed using simple disjointness and independence assumptions, without the need of performing inclusion–exclusion. The extensional approach has long been thought to be strictly more powerful than the intensional approach, the reason being that for some queries, the use of inclusion–exclusion seemed unavoidable. In this paper we introduce a new technique to construct lineage representations as deterministic decomposable circuits in polynomial time. We prove that this technique applies to a class of UCQs that had been conjectured to separate the complexity of the two approaches. In essence, we show that relying on the inclusion–exclusion formula can be avoided by using negation. This result brings back hope to prove that the intensional approach can handle all tractable UCQs.
Gradual typing is an effective approach to integrate static and dynamic typing, which supports the smooth transition between both extremes via the imprecision of type annotations. Gradual typing has been applied in many scenarios such as objects, subtyping, effects, ownership, typestates, information-flow typing, parametric polymorphism, etc. In particular, the combination of gradual typing and mutable references has been explored by different authors, giving rise to four different semantics—invariant, guarded, monotonic and permissive references. These semantics were specially crafted to reflect different design decisions with respect to precision and efficiency tradeoffs. Since then, progress has been made in the formulation of methodologies to systematically derive gradual counterparts of statically-typed languages, but these have not been applied to study mutable references.
In this article, we explore how the Abstracting Gradual Typing (AGT) methodology, which has been shown to be effective in a variety of settings, applies to mutable references. Starting from a standard statically-typed language with references, we systematically derive with AGT a novel gradual language. We establish the properties of ; in particular, it is the first gradual language with mutable references that is proven to satisfy the gradual guarantee. We then compare with the main four existing approaches to gradual references, and show that the application of AGT does justify one of the proposed semantics: we formally prove that the treatment of references in corresponds to the guarded semantics, by presenting a bisimilation with the coercion semantics of Herman et al. In the process, we uncover that any direct application of AGT yields a gradual language that is not space-efficient. We consequently adjust the dynamic semantics of to recover space efficiency. We then show how to extend to support both monotonic and permissive references as well. Finally, we provide the first proof of the dynamic gradual guarantee for monotonic references. As a result, this paper sheds further light on the design space of gradual languages with mutable references and contributes to deepening the understanding of the AGT methodology.
In spite of several claims stating that some models are more interpretable than others — e.g., «linear models are more interpretable than deep neural networks» — we still lack a principled notion of interpretability to formally compare among different classes of models. We make a step towards such a notion by studying whether folklore interpretability claims have a correlate in terms of computational complexity theory. We focus on local post-hoc explainability queries that, intuitively, attempt to answer why individual inputs are classified in a certain way by a given model. In a nutshell, we say that a class 1 of models is more interpretable than another class 2, if the computational complexity of answering post-hoc queries for models in 2 is higher than for those in 1. We prove that this notion provides a good theoretical counterpart to current beliefs on the interpretability of models; in particular, we show that under our definition and assuming standard complexity-theoretical assumptions (such as P≠NP), both linear and tree-based models are strictly more interpretable than neural networks. Our complexity analysis, however, does not provide a clear-cut difference between linear and tree-based models, as we obtain different results depending on the particular post-hoc explanations considered. Finally, by applying a finer complexity analysis based on parameterized complexity, we are able to prove a theoretical result suggesting that shallow neural networks are more interpretable than deeper ones.
This paper describes our submission to the SIMAH challenge (SocIaL Media And Harassment). The proposed competition addresses the challenge of harassment detection on Twitter posts as well as the identification of a harassment category. Automatically detecting content containing harassment could be the basis for removing it. Accordingly, the task is considered to be an essential step to distinguishing different types of harassment provides the means to control such a mechanism in a fine-grained way. In this work, we classify a set of Twitter posts into non-harassment or harassment tweets where the last ones are classified as indirect harassment, sexual harassment, or physical harassment. We explore how to use self-attention models for harassment classification in order to combine different baselines’ outputs. For a given post, we use the transformer architecture to encode each baseline output exploiting relationships between baselines and posts. Then, the transformer learns how to combine the outputs of these methods with a BERT representation of the post, reaching a macro-averaged F-score of 0.481 on the SIMAH test set.
This paper presents an analytic study showing that it is entirely possible to analyze the sentiment of an Arabic dialect without constructing any resources. The idea of this work is to use the resources dedicated to a given dialect \textit{X} for analyzing the sentiment of another dialect \textit{Y}. The unique condition is to have \textit{X} and \textit{Y} in the same category of dialects. We apply this idea on Algerian dialect, which is a Maghrebi Arabic dialect that suffers from limited available tools and other handling resources required for automatic sentiment analysis. To do this analysis, we rely on Maghrebi dialect resources and two manually annotated sentiment corpus for respectively Tunisian and Moroccan dialect. We also use a large corpus for Maghrebi dialect. We use a state-of-the-art system and propose a new deep learning architecture for automatically classify the sentiment of Arabic dialect (Algerian dialect). Experimental results show that F1-score is up to 83% and it is achieved by Multilayer Perceptron (MLP) with Tunisian corpus and with Long short-term memory (LSTM) with the combination of Tunisian and Moroccan. An improvement of 15% compared to its closest competitor was observed through this study. Ongoing work is aimed at manually constructing an annotated sentiment corpus for Algerian dialect and comparing the results
Data discovery within large archives is a key issue for modern astronomy: multi-source, multi-wavelength, multi-instrument and large-scale verifications need proper data discovery tools for filtering the very large datasets of observations available nowadays. The Virtual Observatory and file format standards have contributed to allow data discovery at the metadata level, where the filtering is circumscribed to what was explicitly annotated at the observation, calibration or data reduction stages. The next step is to perform data discovery at the content level, where content descriptors are automatically gathered from the observations to perform content-aware search. In a very general sense, this corresponds to automatically generate catalogs from large and diverse datasets. In this work, we consider the public spectroscopic data products from ALMA (fits cubes), and we apply the fast Region of Interest Seek and Extraction algorithm (RoiSE) to obtain content-descriptors of the spatial forms, positions, intensities and wavelengths of the source emissions. Despite the efficiency of the algorithm, it is impractical to process all the data in a batch/sequential manner. Then, the problem was to decide the tools and architecture to use for the task distribution across the datacenter. Between the several distributed/parallel computing alternatives, we selected the Dask packages to build the distributed pipeline that we outline in this paper, mainly because the current RoiSE implementation is written in Python. The main challenge of this pipeline is the diversity of data products: different resolutions, signal-to-noise ratios, densities, morphologies, imaging parameters, etc. Therefore, we include an adaptive parameter tuning mechanism to cope with this diversity. Finally, we present an example of content-aware data discovery over the obtained database.
The rise of bots and their influence on social networks is a hot topic that has aroused the interest of many researchers. Despite the efforts to detect social bots, it is still difficult to distinguish them from legitimate users. Here, we propose a simple yet effective semi-supervised method that allows distinguishing between bots and legitimate users with high accuracy. The method learns a joint representation of social connections and interactions between users by leveraging graph-based representation learning. Then, on the proximity graph derived from user embeddings, a sample of bots is used as seeds for a label propagation algorithm. We demonstrate that when the label propagation is done according to pairwise account proximity, our method achieves F1 = 0.93, whereas other state-of-the-art techniques achieve F1 ≤ 0.87. By applying our method to a large dataset of retweets, we uncover the presence of different clusters of bots in the network of Twitter interactions. Interestingly, such clusters feature different degrees of integration with legitimate users. By analyzing the interactions produced by the different clusters of bots, our results suggest that a significant group of users was systematically exposed to content produced by bots and to interactions with bots, indicating the presence of a selective exposure phenomenon.
In the consensus protocols used in most cryptocurrencies, participants called miners must find valid blocks of transactions and append them to a shared tree-like data structure. Ideally, the rules of the protocol should ensure that miners maximize their gains if they follow a default strategy, which consists on appending blocks only to the longest branch of the tree, called the blockchain. Our goal is to understand under which circumstances are miners encouraged to follow the default strategy. Unfortunately, most of the existing models work with simplified payoff functions, without considering the possibility that rewards decrease over time because of the game rules (like in Bitcoin), nor integrating the fact that a miner naturally prefers to be paid earlier than later (the economic concept of discount). In order to integrate these factors, we consider a more general model where issues such as economic discount and decreasing rewards can be set as parameters of an infinite stochastic game. In this model, we study the limit situation in which a miner does not receive a full reward for a block if it stops being in the blockchain. We show that if rewards are not decreasing, then miners do not have incentives to create new branches, no matter how high their computational power is. On the other hand, when working with decreasing rewards similar to those in Bitcoin, we show that miners have an incentive to create such branches. Nevertheless, this incentive only occurs when a miner controls a proportion of the computational power which is close to half of the computational power of the entire network.
Conjunctive queries are one of the most common class of queries used in database systems, and the best studied in the literature. A seminal result of Grohe, Schwentick, and Segoufin (STOC 2001) demonstrates that for every class G of graphs, the evaluation of all conjunctive queries whose underlying graph is in G is tractable if, and only if, G has bounded treewidth. In this work, we extend this characterization to the counting problem for conjunctive queries. Specifically, for every class C of conjunctive queries with bounded treewidth, we introduce the first fully polynomial-time randomized approximation scheme (FPRAS) for counting answers to a query in C, and the first polynomial-time algorithm for sampling answers uniformly from a query in C. As a corollary, it follows that for every class G of graphs, the counting problem for conjunctive queries whose underlying graph is in G admits an FPRAS if, and only if, G has bounded treewidth (unless BPP≠P)}. In fact, our FPRAS is more general, and also applies to conjunctive queries with bounded hypertree width, as well as unions of such queries.
The key ingredient in our proof is the resolution of a fundamental counting problem from automata theory. Specifically, we demonstrate the first FPRAS and polynomial time sampler for the set of trees of size n accepted by a tree automaton, which improves the prior quasi-polynomial time randomized approximation scheme (QPRAS) and sampling algorithm of Gore, Jerrum, Kannan, Sweedyk, and Mahaney ’97. We demonstrate how this algorithm can be used to obtain an FPRAS for many hitherto open problems, such as counting solutions to constraint satisfaction problems (CSP) with bounded hypertree-width, counting the number of error threads in programs with nested call subroutines, and counting valid assignments to structured DNNF circuits.
We study two simple yet general complexity classes, which provide a unifying framework for efficient query evaluation in areas like graph databases and information extraction, among others. We investigate the complexity of three fundamental algorithmic problems for these classes: enumeration, counting and uniform generation of solutions, and show that they have several desirable properties in this respect.
Both complexity classes are defined in terms of non deterministic logarithmic-space transducers (NL transducers). For the first class, we consider the case of unambiguous NL transducers, and we prove constant delay enumeration, and both counting and uniform generation of solutions in polynomial time. For the second class, we consider unrestricted NL transducers, and we obtain polynomial delay enumeration, approximate counting in polynomial time, and polynomialtime randomized algorithms for uniform generation. More specifically, we show that each problem in this second class admits a fully polynomial-time randomized approximation scheme (FPRAS) and a polynomial-time Las Vegas algorithm (with preprocessing) for uniform generation. Remarkably, the key idea to prove these results is to show that the fundamental problem #NFA admits an FPRAS, where #NFA is the problem of counting the number of strings of length n (given in unary) accepted by a non-deterministic finite automaton (NFA). While this problem is known to be #P-complete and, more precisely, SpanL-complete, it was open whether this problem admits an FPRAS. In this work, we solve this open problem, and obtain as a welcome corollary that every function in SpanL admits an FPRAS.
We describe some approaches to explanations for observed outcomes in data management and machine learning. They are based on the assignment of numerical scores to predefined and potentially relevant inputs. More specifically, we consider explanations for query answers in databases, and for results from classification models. The described approaches are mostly of a causal and counterfactual nature. We argue for the need to bring domain and semantic knowledge into score computations; and suggest some ways to do this.
We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and is theoretically justified by being the very single wealth-distribution measure that satisfies some natural axioms. While this value has been investigated in several areas, it received little attention in data management. We study this measure in the context of conjunctive and aggregate queries by defining corresponding coalition games. We provide algorithmic and complexity-theoretic results on the computation of Shapley-based contributions to query answers; and for the hard cases we present approximation algorithms.
There is a recently established correspondence between database tuples as causes for query answers in databases and tuple-based repairs of inconsistent databases with respect to denial constraints. In this work, answer-set programs that specify database repairs are used as a basis for solving computational and reasoning problems around causality in databases, including causal responsibility. Furthermore, causes are introduced also at the attribute level by appealing to an attribute-based repair semantics that uses null values. Corresponding repair-programs are introduced, and used as a basis for computation and reasoning about attribute-level causes. The answer-set programs are extended in order to capture causality under integrity constraints.
SHACL (SHape Constraint Language) is a W3C recommendation for validating graph-based data against a set of constraints (called shapes). Importantly, SHACL allows to define recursive shapes, i.e. a shape may refer to itself, directly of indirectly. The recommendation left open the semantics of recursive shapes, but proposals have emerged recently to extend the official semantics to support recursion. These proposals are based on the principle of possibility (or non-contradiction): a graph is considered valid against a schema if one can assign shapes to nodes in such a way that all constraints are satisfied. This semantics is not constructive, as it does not provide guidelines about how to obtain such an assignment, and it may lead to unfounded assignments, where the only reason to assign a shape to a node is that it allows validating the graph.
In contrast, we propose in this paper a stricter, more constructive semantics for SHACL, based on stable models, which are well-known in Answer Set Programming (ASP). This semantics additionally requires a shape assignment to be properly justified by the input constraints. We further exploit the connection to logic programming, and show that SHACL constraints can be naturally represented as logic programs, and that the validation problem for a graph and a SHACL schema can be encoded as an ASP reasoning task. The proposed semantics also enjoys computationally tractable validation in the presence of constraints with stratified negation (as opposed to the previous semantics). We also extend our semantics to 3-valued stable models, which yields a more relaxed notion of validation, tolerant to certain faults in the schema or data. By exploiting a connection between 3-valued stable model semantics and the well-founded semantics for logic programs, we can use our translation into ASP to show another tractability result. Finally, we provide a preliminary evaluation of the approach, which leverages an ASP solver to perform graph validation.
As graph databases grow in popularity, decades of work in graph query languages and models are materialising in industry standards and in the construction of new graph database systems. However, this surge in graph systems has in turn opened up a series of new, interesting research problems related to graph databases.
Our first set of problems has to do with more efficient ways of computing the answers of graph queries, specifically graph patterns, path queries, and combinations between them. Traditionally, researchers in graph databases have pointed out that relational systems are ill-equipped to process these types of queries, and if one looks at the performance of native graph database systems, there is clearly a lot of room for improvement. The talk focuses on two possible directions for improving the state of the art in graph query processing. The first is implementing worst-case optimal algorithms for processing graph patterns that traduce in relational queries with several joins. Some advances are already in development (see e.g. Nguyen, Dung, et al. «Join processing for graph patterns: An old dog with new tricks.» GRADES’15. or Hogan, Aidan, et al. «A Worst-Case Optimal Join Algorithm for SPARQL.» ISWC’19.), but we are still far from a full fledged solution: most algorithms require complex data structures, or need further support in terms of heuristics to select an order in which joins are processed. Second, we need to understand what is the best way of evaluating path queries (that is, finding all pairs of nodes connected by a path), in such a way that these results can be further integrated with other query results in a graph system pipeline. We already have complexity results regarding path computation and enumeration for different semantics of path queries (see e.g. Martens, Wim, and Tina Trautner. «Evaluation and enumeration problems for regular path queries.» ICDT’18. or Bagan, Guillaume, Angela Bonifati, and Benoit Groz. «A trichotomy for regular simple path queries on graphs.» PODS’13.), but still very little is known in terms of optimal processing of path queries when inside a tractable fragment.
Our second set of problems is related to graph analytics, one of the current selling points of graph databases. Systems should be able to run more complex analytical queries involving tasks such as more complex path finding, centrality or clustering. It is also important to be able to run these algorithms not over native graphs, but perhaps over a certain set of nodes or edges previously selected by a graph query, and one may also want to pose further queries over the result of the analytics task. Finally, all of this should be done in an efficient way, specially in the prospect that graph databases may contain a huge amount of nodes. In this talk I will discuss possible approaches to perform these operations, covering aspects from the design of languages for graph analytics to efficient ways of processing them, and also comparing the expressive power of graph analytics solutions with other forms of graph computation.
Language support for differentially-private programming is both crucial and delicate. While elaborate program logics can be very expressive, type-system based approaches using linear types tend to be more lightweight and amenable to automatic checking and inference, and in particular in the presence of higher-order programming. Since the seminal design of Fuzz, which is restricted to ϵ-differential privacy, a lot of effort has been made to support more advanced variants of differential privacy, like (ϵ,δ)-differential privacy. However, supporting these advanced privacy variants while also supporting higher-order programming in full has been proven to be challenging. We present Jazz, a language and type system which uses linear types and latent contextual effects to support both advanced variants of differential privacy and higher-order programming. Even when avoiding advanced variants and higher-order programming, our system achieves higher precision than prior work for a large class of programming patterns. We formalize the core of the Jazz language, prove it sound for privacy via a logical relation for metric preservation, and illustrate its expressive power through a number of case studies drawn from the recent differential privacy literature.
Although the amount of RDF data has been steadily increasing over the years, the majority of information on the Web is still residing in other formats, and is often not accessible to Semantic Web services. A lot of this data is available through APIs serving JSON documents. In this work we propose a way of extending SPARQL with the option to consume JSON APIs and integrate this information into SPARQL query answers, obtaining a language that combines data from the “traditional” Web to the Semantic Web. Our proposal is based on an extension of the SERVICE operator with the ability to connect to JSON APIs. With the aim of evaluating these queries as efficiently as possible, we show that the main bottleneck is the amount of API requests, and present an algorithm that produces “worst-case optimal” query plans that reduce the number of requests as much as possible. We note that the analysis of this algorithm is studied in terms of an algorithm for evaluating relational queries with access methods with the minimal number of access queries, which is of independent interest. We show the superiority of the worst-case optimal approach in a series of experiments that take existing SPARQL benchmarks, and augment them with the ability to connect to JSON APIs in order to obtain additional information.
The automatic detection of rumors in social networks is an important problem that would allow counteracting the effects that the propagation of false information produces. We study the performance of deep learning architectures in this problem, analyzing ten different machines on word2vec and BERT. Our results show that some architectures are more suitable for some particular classes, suggesting that the use of committee machines would offer advantages in this task.
There is a resurgence of interest in political parties. This resurgent interest embraces a minimalist definition of political parties, according to which any group that competes in elections and receives a handful of votes qualifies as a party. Parties, however, are expected to contribute to democratic representation, and the party politics literature has extensively shown that many “parties” do not fulfill this expectation. These entities that possess some but not all defining features of political parties can be considered diminished subtypes of the category. A thorough conceptualization of diminished subtypes could improve the analytical value of the study of political parties and of other forms of electoral political organizations. In this article, therefore, we put forth a new typology of diminished subtypes of political parties based on the presence or absence of two primary attributes: horizontal coordination of ambitious politicians during electoral campaigns and while in office and vertical aggregation to electorally mobilize collective interests and to intermediate and channel collective demands.
Representation learning has been a fruitful area in recent years, driven by the growing interest in deep learning methods. In particular, word representation learning, a.k.a. word embeddings has triggered progress in different natural language processing (NLP) tasks. Despite the success of word embeddings in tasks such as named entity recognition or textual entailment, their use is still embryonic in query expansion. In this work, we examine the usefulness of word embeddings to represent queries and documents in query-document matching tasks. For this purpose, we use a re-ranking strategy. The re-ranking phase is conducted using representations of queries and documents based on word embeddings. We introduce IDF average word embeddings, a new text representation strategy based on word embeddings, which allows us to create a query vector representation that provides higher relevance to informative terms during the process. Experimental results in TREC benchmark datasets show that our proposal consistently achieves the best results in terms of MAP.
The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based
language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our
pre-trained Spanish model we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data and the compilation of the Spanish benchmarks.
This article shows an innovation project that aims contributing, from the ICT perspective, to necessities of health sector, specifically in interoperability and generation of information starting from distributed sources. For these purposes, the technological objective is to develop a standardized interoperable repository and intelligent applications, at prototype level, feasible to massify in order to contribute to the timely care of patients, through information generated by the processing of radiological images and reports associated with clinical records.
Many interesting search problems can be formulated as bi-objective search problems, that is, search problems where two kinds of costs have to be minimized, for example, travel distance and time for transportation problems. Bi-objective search algorithms have to maintain the set of undominated paths from the start state to each state to compute the set of paths from the start state to the goal state that are not dominated by some other path from the start state to the goal state (called the Pareto-optimal solution set). Each time they find a new path to a state s, they perform a dominance check to determine whether this path dominates any of the previously found paths to s or whether any of the previously found paths to s dominates this path. Existing algorithms do not perform these checks efficiently. On the other hand, our Bi-Objective A* (BOA*) algorithm requires only constant time per check. In our experimental evaluation, we show that BOA* can run an order of magnitude (or more) faster than state-of-the-art bi-objective search algorithms, such as NAMOA*, NAMOA*dr, Bi-Objective Dijkstra, and Bidirectional Bi-Objective Dijkstra.
Grid path planning is an important problem in AI. Its understanding has been key for the development of autonomous navigation systems. An interesting and rather surprising fact about the vast literature on this problem is that only a few neighborhoods have been used when evaluating these algorithms. Indeed, only the 4- and 8-neighborhoods are usually considered, and rarely the 16-neighborhood. This paper describes three contributions that enable the construction of effective grid path planners for extended 2k-neighborhoods; that is, neighborhoods that admit 2k neighbors per state, where k is a parameter. First, we provide a simple recursive definition of the 2k-neighborhood in terms of the 2k-1-neighborhood. Second, we derive distance functions, for any k ≥ 2, which allow us to propose admissible heuristics that are perfect for obstacle-free grids, which generalize the well-known Manhattan and Octile distances. Third, we define the notion of canonical path for the 2k-neighborhood; this allows us to incorporate our neighborhoods into two versions of A*, namely Canonical A* and Jump Point Search (JPS), whose performance, we show, scales well when increasing k. Our empirical evaluation shows that, when increasing k, the cost of the solution found improves substantially. Used with the 2k-neighborhood, Canonical A* and JPS, in many configurations, are also superior to the any-angle path planner Theta* both in terms of solution quality and runtime. Our planner is competitive with one implementation of the any-angle path planner, ANYA in some configurations. Our main practical conclusion is that standard, well-understood grid path planning technology may provide an effective approach to any-angle grid path planning.
This article presents sparql-gremlin, a tool to translate SPARQL queries to Gremlin pattern matching traversals. Currently, sparql-gremlin is a plugin of the Apache TinkerPop graph computing framework, thus the users can run queries expressed in the W3C SPARQL query language over a wide variety of graph data management systems, including both OLTP graph databases and OLAP graph processing frameworks. With sparql-gremlin, we perform the first step to bridge the query interoperability gap between the Semantic Web and Graph database communities. The plugin has received adoption from both academia and industry research in its short timespan.
Higher education institutions are increasingly considering the use of a form of blended learning, commonly named as flipped classroom (FC), in which students watch video lectures drawn from a massive online open course (MOOC) before a face-to-face lecture. This methodology is attractive, as it allows institutions to reuse high-quality material developed for MOOCs, while increasing learning flexibility and the students’ autonomy. However, the adoption of this methodology is low in general, especially in Engineering courses, as its implementation faces a number of challenges for students. The most salient challenge is the lack of student self-regulatory skills, which may result in frustration and low performance. In this paper, we study how a self-regulatory learning technological scaffold, which provides students with feedback about their activity in the MOOC, affects the engagement and performance of students in an Engineering course following a MOOC-based FC approach. To this end, we design an observational study with the participation of 242 students: 133 students in the experimental group (EG) who used a technological scaffold and 109 in the control group (CG) who did not. We did not find a statistically significant difference between the academic achievements of both groups. However, the EG exhibited a statistically significant greater engagement with the course and a more accurate strategic planning than the CG. The main implications for scaffolding self-regulated learning in FC derived from these results are discussed.
Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size 𝛾 of the smallest string attractor, was introduced. The measure 𝛾 lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space 𝑂(𝛾log𝑛𝛾), which also upper bounds most measures (including z). While 𝛾 is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no 𝑜(𝛾log𝑛)-space representation of strings is known. In this paper, we study a smaller measure, 𝛿≤𝛾, which can be computed in linear time. We show that 𝛿 better captures the compressibility of repetitive strings. For every length n and every value 𝛿≥2, we construct a string such that 𝛾=𝛺(𝛿log𝑛𝛿). Still, we show a representation of any string S in 𝑂(𝛿log𝑛𝛿) space that supports direct access to any character S[i] in time 𝑂(log𝑛𝛿) and finds the occ occurrences of any pattern 𝑃[1..𝑚] in time 𝑂(𝑚log𝑛+𝑜𝑐𝑐log𝜀𝑛) for any constant 𝜀>0. Further, we prove that no 𝑜(𝛿log𝑛)-space representation exists: for every length n and every value 2≤𝛿≤𝑛1−𝜀, we exhibit a string family whose elements can only be encoded in 𝛺(𝛿log𝑛𝛿) space. We complete our characterization of 𝛿 by showing that, although 𝛾, z, and other repetitiveness measures are always 𝑂(𝛿log𝑛𝛿), for strings of any length n, the smallest context-free grammar can be of size 𝛺(𝛿log2𝑛/loglog𝑛). No such separation is known for 𝛾.
Let be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess into a data structure that, given a query , can return the k documents of most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking function . Linear space and optimal query time solutions already exist for this problem. In this paper we consider a novel problem, document selection, in which a query aims to report the kth document most relevant to P (instead of reporting all top-k documents). We present a data structure using space, for any constant , answering selection queries in time , and a linear-space data structure answering queries in time , given the locus node of P in a (generalized) suffix tree of . We also prove that it is unlikely that a succinct-space solution for this problem exists with poly-logarithmic query time, and that is indeed optimal within space for most text families. Finally, we present some additional space-time trade-offs exploring the extremes of those lower bounds.
We address the problem of representing dynamic graphs using k 2 -trees. The k 2 -tree data structure is one of the succinct data structures proposed for representing static graphs, and binary relations in general. It relies on compact representations of bit vectors. Hence, by relying on compact representations of dynamic bit vectors, we can also represent dynamic graphs. In this paper we follow instead the ideas by Munro et al., and we present an alternative implementation for representing dynamic graphs using k 2 -trees. Our experimental results show that this new implementation is competitive in practice.
We propose answer-set programs that specify and compute counterfactual interventions on entities that are input on a classification model. In relation to the outcome of the model, the resulting counterfactual entities serve as a basis for the definition and computation of causality-based explanation scores for the feature values in the entity under classification, namely «responsibility scores». The approach and the programs can be applied with black-box models, and also with models that can be specified as logic programs, such as rule-based classifiers. The main focus of this work is on the specification and computation of «best» counterfactual entities, i.e. those that lead to maximum responsibility scores. From them one can read off the explanations as maximum responsibility feature values in the original entity. We also extend the programs to bring into the picture semantic or domain knowledge. We show how the approach could be extended by means of probabilistic methods, and how the underlying probability distributions could be modified through the use of constraints. Several examples of programs written in the syntax of the DLV ASP-solver, and run with it, are shown.We propose answer-set programs that specify and compute counterfactual interventions on entities that are input on a classification model. In relation to the outcome of the model, the resulting counterfactual entities serve as a basis for the definition and computation of causality-based explanation scores for the feature values in the entity under classification, namely «responsibility scores». The approach and the programs can be applied with black-box models, and also with models that can be specified as logic programs, such as rule-based classifiers. The main focus of this work is on the specification and computation of «best» counterfactual entities, i.e. those that lead to maximum responsibility scores. From them one can read off the explanations as maximum responsibility feature values in the original entity. We also extend the programs to bring into the picture semantic or domain knowledge. We show how the approach could be extended by means of probabilistic methods, and how the underlying probability distributions could be modified through the use of constraints. Several examples of programs written in the syntax of the DLV ASP-solver, and run with it, are shown.
Work on knowledge graphs and graph-based data management often focus either on declarative graph query languages or on frameworks for graph analytics, where there has been little work in trying to combine both approaches. However, many real-world tasks conceptually involve combinations of these approaches: a graph query can be used to select the appropriate data, which is then enriched with analytics, and then possibly filtered or combined again with other data by means of a query language. In this paper we propose a declarative language that is well suited to perform graph querying and analytical tasks. We do this by proposing a minimalistic extension of SPARQL to allow for expressing analytical tasks; in particular, we propose to extend SPARQL with recursive features, and provide a formal syntax and semantics for our language. We show that this language can express key analytical tasks on graphs (in fact, it is Turing complete), offering a more declarative alternative to existing frameworks and languages. We show how procedures in our language can be implemented over an off-the-shelf SPARQL engine with a specialised client that allows parallelisation and batch-based processing when memory is limited. Results show that with such an implementation, procedures for popular analytics currently run in seconds or minutes for selective sub-graphs (our target use-case) but struggle at larger scales.
The predecessor problem is a key component of the fundamental sorting-and-searching core of algorithmic problems. While binary search is the optimal solution in the comparison model, more realistic machine models on integer sets open the door to a rich universe of data structures, algorithms, and lower bounds. In this article, we review the evolution of the solutions to the predecessor problem, focusing on the important algorithmic ideas, from the famous data structure of van Emde Boas to the optimal results of Patrascu and Thorup. We also consider lower bounds, variants, and special cases, as well as the remaining open questions.
A continuación, se presentará una reflexión respecto a fenómenos, de carácter global, relacionados con la actual crisis de la representación política en América Latina, la que se enmarca en cambios tanto estructurales como coyunturales de la región. Para ello, se analizarán seis fenómenos que se encuentran en el contexto de la crisis actual y que hacen referencia a cambios en la estructura social, para luego dar paso a las principales consecuencias que estos generan.
The raster model is commonly used for the representation of images in many domains, and is especially useful in Geographic Information Systems (GIS) to store information about continuous variables of the space (elevation, temperature, etc.). Current representations of raster data are usually designed for external memory or, when stored in main memory, lack efficient query capabilities. In this paper we propose compact representations to efficiently store and query raster datasets in main memory. We present different representations for binary raster data, general raster data and time-evolving raster data. We experimentally compare our proposals with traditional storage mechanisms such as linear quadtrees or compressed GeoTIFF files. Results show that our structures are up to 10 times smaller than classical linear quadtrees, and even comparable in space to non-querieable representations of raster data, while efficiently answering a number of typical queries.
Conditional Cash Transfers (CCTs) have spread through the developing world in the past two decades. It is often assumed that CCTs enjoy political support in the population precisely because they impose conditions on beneficiaries. This article employs survey experiments in Brazil and Turkey to determine whether, and in what contexts, making government transfers conditional on behaviour of beneficiaries increases political support for the programmes. Results show that conditional transfers are only marginally more popular than similar unconditional transfers in nationally representative samples, but that this difference is substantially larger among the better-off and among those primed to think of themselves as different from beneficiaries. These findings imply that conditionalities per se are not as strong a determinant of support for transfers as the literature suggests, but that they can still be helpful in building support for transfers among subsets of the population that are least likely to support them.
Hate speech is an important problem that is seriously affecting the dynamics and usefulness of online social communities. Large scale social platforms are currently investing important resources into automatically detecting and classifying hateful content, without much success. On the other hand, the results reported by state-of-the-art systems indicate that supervised approaches achieve almost perfect performance but only within specific datasets, most of them in English language. In this work, we analyze this apparent contradiction between existing literature and actual applications. We study closely the experimental methodology used in prior work and their generalizability to other datasets. Our findings evidence methodological issues, as well as an important dataset bias. As a consequence, performance claims of the current state-of-the-art have become significantly overestimated. The problems that we have found are mostly related to data overfitting and sampling issues. We discuss the implications for current research and re-conduct experiments to give a more accurate picture of the current state-of-the art methods. Moreover, we design some baseline approaches to perform cross-lingual experiments, using English and Spanish datasets.
The Uruguayan political system underwent major changes during 2019. The Frente Amplio (Broad Front, FA), a center-left party that had been in office since 2005, lost the national elections against a coalition of center-right parties. This article illustrates the importance of economic variables in presidential elections in Uruguay. The latter coalition was named Coalición Multicolor (Multicolor Coalition) and was led by Luis Lacalle Pou from the Partido Nacional (Partido Nacional, PN). The 2019 national election results yielded not only a government turnover but also a restructuring of the parties on the political right. A new rightist party, Cabildo Abierto (CA), had a meteoric rise. Led by a former Army Commander-in-Chief, CA managed to win 10% of the votes and has become a key actor in the government coalition.
There is a resurgence of interest in political parties. This resurgent interest embraces a minimalist definition of political parties, according to which any group that competes in elections and receives a handful of votes qualifies as a party. Parties, however, are expected to contribute to democratic representation, and the party politics literature has extensively shown that many “parties” do not fulfill this expectation. These entities that possess some but not all defining features of political parties can be considered diminished subtypes of the category. A thorough conceptualization of diminished subtypes could improve the analytical value of the study of political parties and of other forms of electoral political organizations. In this article, therefore, we put forth a new typology of diminished subtypes of political parties based on the presence or absence of two primary attributes: horizontal coordination of ambitious politicians during electoral campaigns and while in office and vertical aggregation to electorally mobilize collective interests and to intermediate and channel collective demands.
Solving a Multi-Agent Pathfinding (MAPF) problem involves finding non-conflicting paths that lead a number of agents to their goal location. In the sum-of-costs variant of MAPF, one is also required to minimize the total number of moves performed by agents before stopping at the goal. Not surprisingly, since MAPF is combinatorial, a number of compilations to Satisfiability solving (SAT) and Answer Set Programming (ASP) exist. In this paper, we propose the first family of compilations to ASP that solve sum-of-costs MAPF over 4-connected grids. Unlike existing compilations to ASP that we are aware of, our encoding is the first that, after grounding, produces a number of clauses that is linear on the number of agents. In addition, the representation of the optimization objective is also carefully written, such that its size after grounding does not depend on the size of the grid. In our experimental evaluation, we show that our approach outperforms search- and SAT-based sum-of-costs MAPF solvers when grids are congested with agents.
Incremental heuristic search algorithms are a class of heuristic search algorithms applicable to the problem of goal-directed navigation. D* and D*Lite are among the most well-known algorithms for this problem. Recently, two new algorithms have been shown to outperform D*Lite in relevant benchmarks: Multi-Path Adaptive A* (MPAA*) and D*ExtraLite. Existing empirical evaluations, unfortunately, do not allow to obtain meaningful conclusions regarding the strengths and weaknesses of these algorithms. Indeed, in the paper introducing D*ExtraLite, it is shown that D*Lite outperforms MPAA* in benchmarks in which the authors of MPAA* claim superiority over D*Lite. The existence of published contradictory data unfortunately does not allow practitioners to make decisions over which algorithm to use given a specific application. In this paper, we analyze two factors that significantly influence the performance of MPAA*, explaining why it is possible to obtain very different results depending on such factors. We identify a configuration of MPAA* which, in the majority of the benchmark problems we use, exhibits superior performance when compared to both D*Lite and D*ExtraLite. We conclude that MPAA* should be the algorithm of choice in goal-directed navigation scenarios in which the heuristic is accurate, whereas D*ExtraLite should be preferred when the heuristic is inaccurate.
Twitter constitutes a rich resource for investigating language contact phenomena. In this paper, we report findings from the analysis of a large-scale diachronic corpus of over one million tweets, containing loanwords from te reo Māori, the indigenous language spoken in New Zealand, into (primarily, New Zealand) English. Our analysis focuses on hashtags comprising mixed-language resources (which we term hybrid hashtags), bringing together descriptive linguistic tools (investigating length, word class, and semantic domains of the hashtags) and quantitative methods (Random Forests and regression analysis). Our work has implications for language change and the study of loanwords (we argue that hybrid hashtags can be linked to loanword entrenchment), and for the study of language on social media (we challenge proposals of hashtags as “words,” and show that hashtags have a dual discourse role: a micro-function within the immediate linguistic context in which they occur and a macro-function within the tweet as a whole).
Given a set of d-dimensional boxes (i.e., axis-aligned hyperrectangles), a minimum coverage kernel is a subset of of minimum size covering the same region as . Computing it is -hard, but as for many similar -hard problems (e.g., Box Cover, and Orthogonal Polygon Covering), the problem becomes solvable in polynomial time under restrictions on . We show that computing minimum coverage kernels remains -hard even when restricting the graph induced by the input to a highly constrained class of graphs. Alternatively, we present two polynomial-time approximation algorithms for this problem: one deterministic with an approximation ratio within , and one randomized with an improved approximation ratio within (with high probability).
We present the first solution to finding τ-majorities on tree paths. Given a tree of n nodes, each with a label from , and a fixed threshold , such a query gives two nodes u and v and asks for all the labels that appear more than times in the path from u to v, where denotes the number of nodes in . Note that the answer to any query is of size up to . On a w-bit RAM, we obtain a linear-space data structure with query time, which is worst-case optimal for polylogarithmic-sized alphabets. We also describe two succinct-space solutions with query time ⁎. One uses bits, where is the entropy of the label distribution; the other uses bits. By using just extra bits, our succinct structures allow τ to be specified at query time. We obtain analogous results to find a τ-minority, that is, an element that appears between 1 and times in.
We present a compact data structure to represent both the duration and length of homogeneous segments of trajectories from moving objects in a way that, as a data warehouse, it allows us to efficiently answer cumulative queries. The division of trajectories into relevant segments has been studied in the literature under the topic of Trajectory Segmentation. In this paper, we design a data structure to compactly represent them and the algorithms to answer the more relevant queries. We experimentally evaluate our proposal in the real context of an enterprise with mobile workers (truck drivers) where we aim at analyzing the time they spend in different activities. To test our proposal under higher stress conditions we generated a huge amount of synthetic realistic trajectories and evaluated our system with those data to have a good idea about its space needs and its efficiency when answering different types of queries.
Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count with several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality we either need to build completely new indexes, or we must populate the database with several instantiations of indexes such as B+-trees. Either way, this means spending an extra amount of storage space that may be non-negligible.
We show that optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of extra storage. Our representation is a compact quadtree for the static indexes, and a dynamic quadtree sharing subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, and show that the running time of this algorithm is worst-case optimal in data complexity. Remarkably, we can extend our framework to evaluate more expressive queries from relational algebra by introducing a lazy version of qdags (lqdags). Once again, we can show that the running time of our algorithms is worst-case optimal.
Lempel-Ziv is an easy-to-compute member of a wide family of so-called macro schemes; it restricts pointers to go in one direction only. Optimal bidirectional macro schemes are NP-complete to find, but they may provide much better compression on highly repetitive sequences. We consider the problem of approximating optimal bidirectional macro schemes. We describe a simulated annealing algorithm that usually converges quickly. Moreover, in some cases, we obtain bidirectional macro schemes that are provably a 2-approximation of the optimal. We test our algorithm on a number of artificial repetitive texts and verify that it is efficient in practice and outperforms Lempel-Ziv, sometimes by a wide margin.
There are many representations of planar graphs, but few are as elegant as Turán’s (1984): it is simple and practical, uses only 4 bits per edge, can handle self-loops and multi-edges, and can store any specified embedding. Its main disadvantage has been that “it does not allow efficient searching” (Jacobson, 1989). In this paper we show how to add a sublinear number of bits to Turán’s representation such that it supports fast navigation while retaining simplicity. As a consequence of the inherited simplicity, we offer the first efficient parallel construction of a compact encoding of a planar graph embedding. Our experimental results show that the resulting representation uses about 6 bits per edge in practice, supports basic navigation operations within a few microseconds, and can be built sequentially at a rate below 1 microsecond per edge, featuring a linear speedup with a parallel efficiency around 50% for large datasets.
In the range 𝛼-majority query problem, we are given a sequence 𝑆[1…𝑛] and a fixed threshold 𝛼∈(0,1), and are asked to preprocess S such that, given a query range [𝑖…𝑗], we can efficiently report the symbols that occur more than 𝛼(𝑗−𝑖+1) times in 𝑆[𝑖…𝑗], which are called the range 𝛼-majorities. In this article we describe the first compressed dynamic data structure for range 𝛼-majority queries. It represents S in compressed space—𝑛𝐻𝑘+𝑜(𝑛lg𝜎) bits for any 𝑘=𝑜(lg𝜎𝑛), where 𝜎 is the alphabet size and 𝐻𝑘≤𝐻0≤lg𝜎 is the kth order empirical entropy of S—and answers queries in 𝑂(lg𝑛𝛼lglg𝑛) time while supporting insertions and deletions in S in 𝑂(lg𝑛𝛼) amortized time. We then show how to modify our data structure to receive some 𝛽≥𝛼 at query time and report the range 𝛽-majorities in 𝑂(lg𝑛𝛽lglg𝑛) time, without increasing the asymptotic space or update-time bounds. The best previous dynamic solution has the same query and update times as ours, but it occupies O(n) words and cannot take advantage of being given a larger threshold 𝛽 at query time. We also design the first dynamic data structure for range 𝛼-minority—i.e., find a non-𝛼-majority that occurs in a range—and obtain space and time bounds similar to those for 𝛼-majorities. We extend the structure to find 𝛩(1/𝛼)𝛼-minorities at the same space and time cost. By giving up updates, we obtain static data structures with query time 𝑂((1/𝛼)lglg𝑤𝜎) for both problems, on a RAM with word size 𝑤=𝛺(lg𝑛) bits, without increasing our space bound. Static alternatives reach time 𝑂(1/𝛼), but they compress S only to zeroth order entropy (𝐻0) or they only handle small values of 𝛼, that is, lg(1/𝛼)=𝑜(lg𝜎).
Party system institutionalization (PSI) is a critical dimension of modern democracies. However, conventional approaches to institutionalization do not include party systems’ ability to adapt and respond to challenges that emanate from society, one of the crucial traits in Huntington’s definition of institutionalization. We discuss conventional approaches to the analysis of PSI. Building upon the idea of social orders put forth by North, Wallis, and Weingast, we argue that the analysis of institutionalization at the level of party systems must consider the system’s ability to provide open access and to include all sectors: that is, the system’s ability to incorporate demands that emanate from society. We propose a new conceptualization and operationalization of PSI, and we present a new data set of PSI indicators for 18 Latin American countries. Finally, we analyze the data to assess the level of PSI and type of party system in each Latin American country.
Word embeddings are known to exhibit stereotyp- ical biases towards gender, race, religion, among other criteria. Several fairness metrics have been proposed in order to automatically quantify these biases. Although all metrics have a similar objective, the relationship between them is by no means clear. Two issues that prevent a clean comparison is that they operate with different inputs, and that their outputs are incompatible with each other.
In this paper we propose WEFE, the word embed- dings fairness evaluation framework, to encapsulate, evaluate and compare fairness metrics. Our framework needs a list of pre-trained embeddings and a set of fairness criteria, and it is based on checking correlations between fairness rankings induced by these criteria. We conduct a case study showing that rankings produced by existing fairness methods tend to correlate when measuring gender bias. This correlation is considerably less for other biases like race or religion. We also compare the fairness rankings with an embedding benchmark showing that there is no clear correlation between fairness and good performance in downstream tasks.
We present a system for the task of unsupervised lexical change detection. Given a target word and two corpora spanning different periods of time, automatically detects whether the word has lost or gained senses from one corpus to another. Our system employs the temporal referencing method to obtain compatible representations of target words in different periods of time. This is done by concatenating corpora of different periods and performing a temporal referencing of
target words i.e., treating occurrences of target words in different periods as two independent tokens. Afterwards, we train word embeddings on the joint corpus and compare the referenced vectors of each target word using cosine similarity. Our submission was ranked 6th among 33 teams for subtask 1, obtaining an average accuracy of 0.637, only 0.050 points behind the first ranked system.
GraphQL is a novel language for specifying and querying web APIs, allowing clients to flexibly and efficiently retrieve data of interest. The GraphQL language specification is unfortunately only available in prose, making it hard to develop robust formal results for this language. Recently, Hartig and Pérez proposed a formal semantics for GraphQL in order to study the complexity of GraphQL queries. The semantics is however not mechanized and leaves certain key aspects unverified. We present GraphCoQL, the first mechanized formalization of GraphQL, developed in the Coq proof assistant. GraphCoQL covers the schema definition DSL, query definitions, validation of both schema and queries, as well as the semantics of queries over a graph data model. We illustrate the application of GraphCoQL by formalizing the key query transformation and interpretation techniques of Hartig and Pérez, and proving them correct, after addressing some imprecisions and minor issues. We hope that GraphCoQL can serve as a solid formal baseline for both language design and verification efforts for GraphQL.
Compiler correctness is, in its simplest form, defined as the inclusion of the set of traces of the compiled program into the set of traces of the original program, which is equivalent to the preservation of all trace properties. Here traces collect, for instance, the externally observable events of each execution. This definition requires, however, the set of traces of the source and target languages to be exactly the same, which is not the case when the languages are far apart or when observations are fine-grained. To overcome this issue, we study a generalized compiler correctness definition, which uses source and target traces drawn from potentially different sets and connected by an arbitrary relation. We set out to understand what guarantees this generalized compiler correctness definition gives us when instantiated with a non-trivial relation on traces. When this trace relation is not equality, it is no longer possible to preserve the trace properties of the source program unchanged. Instead, we provide a generic characterization of the target trace property ensured by correctly compiling a program that satisfies a given source property, and dually, of the source trace property one is required to show in order to obtain a certain target property for the compiled code. We show that this view on compiler correctness can naturally account for undefined behavior, resource exhaustion, different source and target values, side-channels, and various abstraction mismatches. Finally, we show that the same generalization also applies to many secure compilation definitions, which characterize the protection of a compiled program against linked adversarial code.
Current static verification techniques do not provide good support for incrementality, making it difficult for developers to focus on specifying and verifying the properties and components that are most important. Dynamic verification approaches support incrementality, but cannot provide static guarantees. To bridge this gap, prior work proposed gradual verification, which supports incrementality by allowing every assertion to be complete, partial, or omitted, and provides sound verification that smoothly scales from dynamic to static checking. The prior approach to gradual verification, however, was limited to programs without recursive data structures. This paper extends gradual verification to programs that manipulate recursive, mutable data structures on the heap. We address several technical challenges, such as semantically connecting iso- and equi-recursive interpretations of abstract predicates, and supporting gradual verification of heap ownership. This work thus lays the foundation for future tools that work on realistic programs and support verification within an engineering process in which cost-benefit trade-offs can be made.
A vast amount of geo-referenced data is being generated by mobile devices and other sensors increasing the importance of spatio-textual analyses on such data. Due to the large volume of data, the use of indexes to speed up the queries that facilitate such analyses is imperative. Many disk resident indexes have been proposed for different types of spatial keyword queries, but their efficiency is harmed by their high I/O costs. In this work, we propose cBiK, the first spatio-textual index that uses compact data structures to reduce the size of the structure, hence facilitating its usage in main memory. Our experimental evaluation, shows that this approach needs half the space and is more than one order of magnitude faster than a disk resident state-of-the-art index. Also, we show that our approach is competitive even in a scenario where the disk resident data structure is warmed-up to fit in main memory.
Review of “The Little Prover” by Daniel P. Friedman and Carl Eastlund, MIT Press, 2015.
Many succinct tree encodings exist, but it is known that Fully Functional is, in general, the best solution in practice at the moment. Similar to other solutions, it works by encoding a tree as a balanced parentheses sequence, and to navigate the tree is equivalent to navigate the parentheses sequence. One of the key operations for this is to find the matching parenthesis of a given one. Fully Functional representation supports this query in constant time, but in practice it is implemented in a time that is logarithmic to the distance between the queried parenthesis and its matching one, which has been proven to be fast enough in practice. It can be seen that the distance between parentheses of a node is usually higher for nodes closer to the tree’s root, as they tend to have more descendants. In other words, every node increments its ancestors’ parentheses distance, thus affecting their query time detrimentally. This leads to shallower trees having lower average query time, since nodes will have fewer ancestors.
We exploit this property in Ferres et al. encoding for planar embeddings. This encoding works by storing a spanning tree T of the graph G, a spanning tree T ′ of the dual of G where the edges in T ′ are dual to the edges in G − T , and a bitvector representing the interleaving between the two trees in a counter- clockwise DFS traversal of G. This encoding allows navigation queries on the primal and dual of G. Originally, T is obtained with a DFS traversal of G, but by instead obtaining T with a BFS traversal of G or its dual we can decrease the height of both trees, consequently decreasing time of navigation operations on the embedding from about 20% up to 50%. We also obtain further speedups on queries on G by performing the BFS on G, while further speedups on queries on its dual can be obtained by performing the BFS on the dual. As future work, it would be interesting to analyze how other succinct tree encodings, such as LOUDS, are affected by the tree topology. On the other hand, we could also study the effect of using flatter trees when possible in other
compact data structures using succinct trees.
In this paper we consider the problem of storing sequences of symbols in a compressed format, while supporting random access to the symbols without decompression. Although this is a well-studied problem when the data is textual, the kind of sequences we look at are not textual, and we argue that traditional compression methods used in the text algorithms community (such as compressors targeting k-th order empirical entropy) do not perform as well on these sequential data, and simpler methods such as Huffman-coding the deltas between sequence elements give better compression performance. We discuss data structures that allow random access to sequence elements that target such measures.
The median string problem is NP-hard under several formulations, being the most competitive heuristics those using perturbation-based iterative algorithms. The initial string and the policy to order possible edit operations are key to the efficiency of such approaches. In this work, we tackle both sub-problems. We hypothesized that a better starting point for the algorithm can reduce the number of edit distances computed to obtain the median string, improving time performance without degrading the quality of the solution. Regarding the starting point, we use the median of a few strings of the input, that is selected as the Half Space Proximal (HSP) neighbors of the median of the set. The HSP neighbors are simultaneously close to the center but also diverse among them. To validate these results, we present comparative experiments, attending mainly to the quality of the median obtained, the time to compute such median, and the number of edit distances computed.
The COVID-19 has brought about a significant challenge to the whole of humanity, but with a special burden upon the medical community. Clinicians must keep updated continuously about symptoms, diagnoses, and effectiveness of emergent treatments under a never-ending flood of scientific literature. In this context, the role of evidence-based medicine (EBM) for curating the most substantial evidence to support public health and clinical practice turns essential but is being challenged as never before due to the high volume of research articles published and pre-prints posted daily. Artificial Intelligence can have a crucial role in this situation. In this article, we report the results of an applied research project to classify scientific articles to support Epistemonikos, one of the most active foundations worldwide conducting EBM. We test several methods, and the best one, based on the XLNet neural language model, improves the current approach by 93\% on average F1-score, saving valuable time from physicians who volunteer to curate COVID-19 research articles manually.
Wikipedia is edited by volunteer editors around the world. Considering the large amount of existing content (e.g. over 5M articles in English Wikipedia), deciding what to edit next can be difficult, both for experienced users that usually have a huge backlog of articles to prioritize, as well as for newcomers who that might need guidance in selecting the next article to contribute. Therefore, helping editors to find relevant articles should improve their performance and help in the retention of new editors. In this paper, we address the problem of recommending relevant articles to editors. To do this,
we develop a scalable system on top of Graph Convolutional Networks and Doc2Vec, learning how to represent Wikipedia articles and deliver personalized recommendations for editors. We test our model on editors’ histories, predicting their most recent edits based on their prior edits. We outperform competitive implicit-feedback collaborative-filtering methods such as WMRF based on ALS, as well as a traditional IR-method such as content-based filtering based on BM25. All of the data used on this paper is publicly available, including graph embeddings for Wikipedia articles, and we release our code to support replication of our experiments. Moreover, we contribute with a scalable implementation of a state-of-art graph
embedding algorithm as current ones cannot efficiently handle the sheer size of the Wikipedia graph.
Positional ranking functions, widely used in web search engines and related search systems, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study time–space trade-offs for search engines with positional ranking functions and text snippet generation. We consider both index-based and non-index based alternatives for positional data. We aim to answer the question of whether positional data should be indexed, and how.
We show that there is a wide range of practical time–space trade-offs. Moreover, we show that using about 1.30 times the space of positional data, we can store everything needed for efficient query processing, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current state-of-the-art alternatives with little impact in query processing time.
Although there are several visually-aware recommendation models in domains like fashion or even movies, the art domain lacks the same level of research attention, despite the recent growth of the online artwork market. To reduce this gap, in this article we introduce CuratorNet, a neural network architecture for visually-aware recommendation of art images. CuratorNet is designed at the core with the goal of maximizing generalization: the network has a fixed set of parameters that only need to be trained once, and thereafter the model is able to generalize to new users or items never seen before, without further training. This is achieved by leveraging visual content: items are mapped to item vectors through visual
embeddings, and users are mapped to user vectors by aggregating the visual content of items they have consumed. Besides the model architecture, we also introduce novel triplet sampling strategies to build a training set for rank learning in the art domain, resulting in more effective learning than naive random sampling. With an evaluation over a real-world dataset of physical paintings, we show that CuratorNet achieves the best performance among several baselines, including the state-of-the-art model VBPR. CuratorNet is motivated and evaluated in the art domain, but its architecture and training
scheme could be adapted to recommend images in other areas.
The success of pre-trained word embeddings has motivated its use in tasks in the biomedical domain. The BERT language model has shown remarkable results on standard performance metrics in tasks such as Named Entity Recognition (NER) and Semantic Textual Similarity (STS), which has brought significant progress in the field of NLP. However, it is unclear whether these systems work seemingly well in critical domains, such as legal or medical. For that reason, in this work, we propose an adversarial evaluation scheme on two well-known datasets for medical NER and STS. We propose two types of attacks inspired by natural spelling errors and typos made by humans. We also propose another type of attack that uses synonyms of medical terms. Under these adversarial settings, the accuracy of the models drops significantly, and we quantify the extent of this performance loss. We also show that we can significantly improve the robustness of the models by training them with adversarial examples. We hope our work will motivate the use of adversarial examples to evaluate and develop models with increased robustness for medical tasks.
Explaining suggestions made by recommendation systems is key to make users trust and accept these systems. This is specially critical in areas such as art image recommendation. Traditionally, artworks are sold in galleries where people can see them physically, and artists have the chance to persuade the people into buying them. On the other side, online art stores only offer the user the action of navigating through the catalog, but nobody plays the persuading role of the artist. Moreover, few works in recommendation systems provide a perspective of the many variables involved in the user perception of several aspects of the system such as domain knowledge, relevance, explainability, and trust. In this article, we aim to fill this gap by studying several aspects of the user experience with a recommender system of artistic images, from algorithmic and HCI perspectives. We conducted two user studies in Amazon Mechanical Turk to evaluate different levels of explainability, combined with different algorithms. While in study 1 we focus only on a desktop interface, in study 2 we attempt to understand the effect of explanations in mobile devices.
In general, our experiments confirm that explanations of recommendations in the image domain are useful and increase user satisfaction, perception of explainability and relevance. In the first study, our results show that the observed effects are dependent on the underlying recommendation algorithm used. In the second study, our results show that these effects are also dependent of the device used in the study but with a smaller effect. Finally, using the framework by Knijnenburg et al., we provide a comprehensive model, for each study, which synthesizes the effects between different variables involved in the user experience with explainable visual recommender systems of artistic images.
Several deep learning architectures have been proposed over the last years to deal with the problem of generating a written report given an imaging exam as input. Most works evaluate the generated reports using standard Natural Language Processing (NLP) metrics (e.g. BLEU, ROUGE), reporting significant progress. In this article, we contrast this progress by comparing state of the art (SOTA) models against weak baselines. We show that simple and even naive approaches yield near SOTA performance on most traditional NLP metrics. We conclude that evaluation methods in this task should be further studied towards correctly measuring clinical accuracy, ideally involving physicians to contribute to this end.
The success of pre-trained word embeddings of the BERT model has motivated its use in tasks in the biomedical domain. However, it is not clear if this model works correctly in real scenarios. In this work, we propose an adversarial evaluation scheme in a BioNER dataset, which consists of two types of attacks inspired by natural spelling errors and synonyms of medical terms. Our results indicate that under these adversarial settings, the performance of the models drops significantly. Despite the result, we show how the robustness of the models can be significantly improved by training them with adversarial examples.
Document screening is a fundamental task within Evidence-based Medicine (EBM), a practice that provides scientific evidence to support medical decisions. Several approaches have tried to reduce physicians’ workload of screening and labeling vast amounts of documents to answer clinical questions. Previous works tried to semi-automate document screening, reporting promising results, but their evaluation was conducted on small datasets, which hinders generalization. Moreover, recent works in natural language processing have introduced neural language models, but none have compared their performance in EBM. In this paper, we evaluate the impact of several document representations such as TF-IDF along with neural language models (BioBERT, BERT, Word2Vec, and GloVe) on an active learning-based setting for document screening in EBM. Our goal is to reduce the number of documents that physicians need to label to answer clinical questions. We evaluate these methods using both a small challenging dataset (CLEF eHealth 2017) as well as a larger one but easier to rank (Epistemonikos). Our results indicate that word as well as textual neural embeddings always outperform the traditional TF-IDF representation. When comparing among neural and textual embeddings, in the CLEF eHealth dataset the models BERT and BioBERT yielded the best results. On the larger dataset, Epistemonikos, Word2Vec and BERT were the most competitive, showing that BERT was the most consistent model across different corpuses. In terms of active learning, an uncertainty sampling strategy combined with a logistic regression achieved the best performance overall, above other methods under evaluation, and in fewer iterations. Finally, we compared the results of evaluating our best models, trained using active learning, with other authors methods from CLEF eHealth, showing better results in terms of work saved for physicians in the document-screening task.
Community Question Answering (cQA) sites have emerged as platforms designed specifically for the exchange of questions and answers among communities of users. Although users tend to find good quality answers in cQA sites, there is evidence that they also engage in a significant volume of QA in other types of social sites, such as microblog platforms. Research indicates that users opt for these non-specific QA social networks because they contain up-to-date information on current events, also due to their rapid information propagation, and social trust. In this sense, we propose that microblog platforms can emerge as a novel, valuable source of information for QA information retrieval tasks. However, we have found that it is not straightforward to transfer existing approaches for automatically retrieving relevant answers in traditional cQA platforms for use in microblogs. This occurs because there are unique characteristics that differentiate microblog data from that of traditional cQA, such as noise and very short text length. In this work, we study (1) if microblog data can be used to automatically provide relevant answers for the QA task, and, in addition, (2) which features contribute the most for finding relevant answers for a particular query. In particular, we introduce a conversation (thread)-level document model, as well as a machine learning ranking framework for microblog QA. We validate our proposal by using factoid-QA as a proxy task, showing that Twitter conversations can indeed be used to automatically provide relevant results for QA. We are able to identify the importance of different features that contribute the most for QA ranking. In addition, we provide evidence that our method allows us to retrieve complex answers in the domain of non-factoid questions.
Probabilistic automata are an extension of nondeterministic finite automata in which transitions are annotated with probabilities. Despite its simplicity, this model is very expressive and many algorithmic questions are undecidable. In this work we focus on the emptiness problem (and its variant the value problem), which asks whether a given probabilistic automaton accepts some word with probability greater than a given threshold. We consider finitely ambiguous probabilistic automata.
Our main contributions are to construct efficient algorithms for analysing finitely ambiguous probabilistic automata through a reduction to a multi-objective optimisation problem called the stochastic path problem. We obtain a polynomial time algorithm for approximating the value of probabilistic automata of fixed ambiguity and a quasi-polynomial time algorithm for the emptiness problem for 2-ambiguous probabilistic automata.
We complement these positive results by an inapproximability result stating that the value of finitely ambiguous probabilistic automata cannot be approximated unless P=NP.
Complex event processing (CEP) has gained a lot of attention for evaluating complex patterns over high-throughput data streams. Recently, new algorithms for the evaluation of CEP patterns have emerged with strong guarantees of efficiency, i.e. constant update-time per tuple and constant-delay enumeration. Unfortunately, these techniques are restricted for patterns with local filters, limiting the possibility of using joins for correlating the data of events that are far apart.
In this paper, we embark on the search for efficient evaluation algorithms of CEP patterns with joins. We start by formalizing the so-called partition-by operator, a standard operator in data stream management systems to correlate contiguous events on streams. Although this operator is a restricted version of a join query, we show that partition-by (without iteration) is equally expressive as hierarchical queries, the biggest class of full conjunctive queries that can be evaluated with constant update-time and constant-delay enumeration over streams. To evaluate queries with partition-by we introduce an automata model, called chain complex event automata (chain-CEA), an extension of complex event automata that can compare data values by using equalities and disequalities. We show that this model admits determinization and is expressive enough to capture queries with partition-by. More importantly, we provide an algorithm with constant update time and constant delay enumeration for evaluating any query definable by chain-CEA, showing that all CEP queries with partition-by can be evaluated with these strong guarantees of efficiency.
We present the theoretical foundations of a new approach in centrality measures for graph data. The main principle of our approach is very simple: the more relevant subgraphs around a vertex, the more central it is in the network. We formalize the notion of «relevant subgraphs» by choosing a family of subgraphs that, give a graph G and a vertex v in G, it assigns a subset of connected subgraphs of G that contains v. Any of such families defines a measure of centrality by counting the number of subgraphs assigned to the vertex, i.e., a vertex will be more important for the network if it belongs to more subgraphs in the family. We show many examples of this approach and, in particular, we propose the all-subgraphs centrality, a centrality measure that takes every subgraph into account. We study fundamental properties over families of subgraphs that guarantee desirable properties over the corresponding centrality measure. Interestingly, all-subgraphs centrality satisfies all these properties, showing its robustness as a notion for centrality. Finally, we study the computational complexity of counting certain families of subgraphs and show a polynomial time algorithm to compute the all-subgraphs centrality for graphs with bounded tree width.
In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user.
In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and regular complex event processing queries and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words.
Some of the most relevant document schemas used online, such as XML and JSON, have a nested format. In the last decade, the task of extracting data from nested documents over streams has become especially relevant. We focus on the streaming evaluation of queries with outputs of varied sizes over nested documents. We model queries of this kind as Visibly Pushdown Transducers (VPT), a computational model that extends visibly pushdown automata with outputs and has the same expressive power as MSO over nested documents. Since processing a document through a VPT can generate a massive number of results, we are interested in reading the input in a streaming fashion and enumerating the outputs one after another as efficiently as possible, namely, with constant-delay. This paper presents an algorithm that enumerates these elements with constant-delay after processing the document stream in a single pass. Furthermore, we show that this algorithm is worst-case optimal in terms of update-time per symbol and memory usage.
Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages to locate the data that a user wants to extract from a text document and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have efficient evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Toward this goal, we present a practical evaluation algorithm that allows output-linear delay enumeration of a spanner’s result after a precomputation phase that is linear in the document. Although the algorithm assumes that the spanner is specified in a syntactic variant of variable-set automata, we also study how it can be applied when the spanner is specified by general variable-set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner and provide a fine-grained analysis of the classes of document spanners that support efficient enumeration of their results.
Book review: «Kevin LaGrandeur, James J. Hughes (eds) (2017) Surviving the Machine Age. Intelligent Technology and the Transformation of Human Work. Cham: Palgrave Macmillan. 166 pages. ISBN: 978-3-319-84584-5«.
Differential privacy is a framework that provides formal tools to develop algorithms to access databases and answer numerical and statistical queries with quantifiable accuracy and privacy guarantees. The notions of differential privacy are defined independent of the data model and the query language. Most results have been on aggregation queries such as counting or finding maximum or average values, and on grouping queries over aggregations such as the creation of histograms. The data model has been typically the relational model and the query language SQL. However, good realizations of deferential privacy for queries that required joins had been limited. This has imposed severe restrictions on applying differential privacy in RDF knowledge graphs and SPARQL. By the simple nature of RDF data, most interesting queries accessing RDF graphs will require intensive use of joins. Recently though, new techniques have been developed that can be applied to many types of joins in SQL with reasonable results. This opened the question of whether these new definitions can be transferred to RDF and SPARQL. In this paper we provide a positive answer to this question by presenting an algorithm that can answer count queries over a large class of SPARQL queries that guarantees differential privacy, if the RDF graph is accompanied with some natural semantic information about its structure. We have implemented our algorithm and conducted several experiments, showing the feasibility of our approach for large databases. Our aim has been to present an approach that can be used as a stepping stone towards extensions and other realizations of differential privacy for SPARQL and RDF.
This work proposes a new approach for mapping GPU threads onto a family of discrete embedded 2D fractals. A block-space map is proposed, from Euclidean parallel space to embedded fractal space , that maps in time and uses no more than threads with being the Hausdorff dimension of the fractal, making it parallel space efficient. When compared to a bounding-box (BB) approach, offers a sub-exponential improvement in parallel space and a monotonically increasing speedup . The Sierpinski gasket fractal is used as a particular case study and the experimental performance results show that reaches up to of speedup over the bounding-box approach. A tensor-core based implementation of is also proposed for modern GPUs, providing up to of extra performance. The results obtained in this work show that doing efficient GPU thread mapping on fractal domains can significantly improve the performance of several applications that work with this type of geometry.
We propose techniques that support the efficient computation of multidimensional similarity joins in an RDF/SPARQL setting, where similarity in an RDF graph is measured with respect to a set of attributes selected in the SPARQL query. While similarity joins have been studied in other contexts, RDF graphs present unique challenges. We discuss how a similarity join operator can be included in the SPARQL language, and investigate ways in which it can be implemented and optimised. We devise experiments to compare three similarity join algorithms over two datasets. Our results reveal that our techniques outperform DBSimJoin: a PostgreSQL extension that supports similarity joins.
Similarity join is a key operation in metric databases. It retrieves all pairs of elements that are similar. Solving such a problem usually requires comparing every pair of objects of the datasets, even when indexing and ad hoc algorithms are used. We propose a simple and efficient algorithm for the computation of the approximated nearest neighbor self-similarity join. This algorithm computes distances and it is empirically shown that it reaches an empirical precision of 46% in real-world datasets. We provide a comparison to other common techniques such as Quickjoin and Locality-Sensitive Hashing and argue that our proposal has a better execution time and average precision.
With the growing amount of digital collections of visual CH data being available across different repositories, it becomes increasingly important to provide archaeologists with means to find relations and cross-correspondences between different digital records. In principle, existing shape- and image-based similarity search methods can aid such domain analysis tasks. However, in practice, visual object data are given in different modalities, and often only in incomplete or fragmented state, posing a particular challenge for conventional similarity search approaches. In this paper we introduce a methodology and system for cross-modal visual search in CH object data that addresses these challenges. Specifically, we propose a new query modality based on 3D views enhanced by user sketches (3D+sketch). This allows for adding new context to the search, which is useful e.g., for searching based on incomplete query objects, or for testing hypotheses on existence of certain shapes in a collection. We present an appropriately designed workflow for constructing query views from incomplete 3D objects enhanced by a user sketch, based on shape completion and texture inpainting. Visual cues additionally help users compare retrieved objects with the query. The proposed approach extends on a previously presented retrieval system by introducing improved retrieval methods, an extended evaluation including retrieval in a larger and richer data collection, and enhanced interactive search weight specification. We demonstrate the feasibility and potential of our approach to support analysis of domain experts in Archaeology and the field of CH in general.
Nowadays, multimedia information such as images and videos are present in many aspects of our lives. Three-dimensional information is also becoming important in different applications, for instance, entertainment, medicine, security, art, just to name a few. It is therefore necessary to study how to properly process 3D information taking advantage of the properties that it provides. This chapter gives an overview of 3D shape matching and its applications in shape retrieval and recognition. In order to present the subject, we opted for describing in detail four approaches with good balance among maturity and novelty, namely, the PANORAMA descriptor, spin images, functional maps, and Heat Kernel Signatures for retrieval. We also aim at stressing the importance of this field in areas such as computer vision and computer graphics, as well as the importance of addressing the main challenges on this research field.
Online social networks are a rich resource of unedited user-generated multimedia content. Buried within
their day-to-day chatter, we can find breaking news, opinions and valuable insight into human behaviour,
including the articulation of emerging social movements. Nevertheless, in recent years social platforms
have become fertile ground for diverse information disorders and hate speech expressions. This situation
poses an important challenge to the extraction of useful and trustworthy information from social media.
In this talk I provide an overview of existing work in the area of social media information credibility,
starting with our research in 2011 on rumor propagation during the massive earthquake in Chile in
2010 [ 1 ]. I discuss, as well, the complex problem of automatic hate speech detection in online social
networks. In particular, how our review of the existing literature in the area shows important experimental
errors and dataset biases that produce an overestimation of current state-of-the-art techniques [ 2].
Especifically, these issues become evident at the moment of attempting to apply these models to more
diverse scenarios or to transfer this knowledge to languages other than English.
As a particular way of dealing with the need to extract reliable information from online social
media, I talk about two applications, Twically [3] and Galean [4]. These applications harvest collective
signals created from social media text to provide a broad view of natural disasters and real-world news,
respectively.
Complex Event Recognition (CER for short) has recently gained attention as a mechanism for detecting patterns in streams of continuously arriving event data. Numerous CER systems and languages have been proposed in the literature, commonly based on combining operations from regular expressions (sequencing, iteration, and disjunction) and relational algebra (e.g., joins and filters). While these languages are naturally first-order, meaning that variables can only bind single elements, they also provide capabilities for filtering sets of events that occur inside iterative patterns; for example requiring sequences of numbers to be increasing. Unfortunately, these type of filters usually present ad-hoc syntax and under-defined semantics, precisely because variables cannot bind sets of events. As a result, CER languages that provide filtering of sequences commonly lack rigorous semantics and their expressive power is not understood.
In this paper we embark on two tasks: First, to define a denotational semantics for CER that naturally allows to bind and filter sets of events; and second, to compare the expressive power of this semantics with that of CER languages that only allow for binding single events. Concretely, we introduce Set-Oriented Complex Event Logic (SO-CEL for short), a variation of the CER language introduced in [Grez et al., 2019] in which all variables bind to sets of matched events. We then compare SO-CEL with CEL, the CER language of [Grez et al., 2019] where variables bind single events. We show that they are equivalent in expressive power when restricted to unary predicates but, surprisingly, incomparable in general. Nevertheless, we show that if we restrict to sets of binary predicates, then SO-CEL is strictly more expressive than CEL. To get a better understanding of the expressive power, computational capabilities, and limitations of SO-CEL, we also investigate the relationship between SO-CEL and Complex Event Automata (CEA), a natural computational model for CER languages. We define a property on CEA called the *-property and show that, under unary predicates, SO-CEL captures precisely the subclass of CEA that satisfy this property. Finally, we identify the operations that SO-CEL is lacking to characterize CEA and introduce a natural extension of the language that captures the complete class of CEA under unary predicates.
We study a variant of the classical membership problem in automata theory, which consists of deciding whether a given input word is accepted by a given automaton. We do so under a different perspective, that is, we consider a dynamic version of the problem, called monitoring problem, where the automaton is fixed and the input is revealed as in a stream, one symbol at a time following the natural order on positions. The goal here is to design a dynamic data structure that can be queried about whether the word consisting of symbols revealed so far is accepted by the automaton, and that can be efficiently updated when the next symbol is revealed. We provide complexity bounds for this monitoring problem, by considering timed automata that process symbols interleaved with timestamps. The main contribution is that monitoring of a one-clock timed automaton, with all its components but the clock constants fixed, can be done in amortised constant time per input symbol.
We present pumping lemmas for five classes of functions definable by fragments of weighted automata over the min-plus semiring, the max-plus semiring and the semiring of natural numbers. As a corollary we show that the hierarchy of functions definable by unambiguous, finitely-ambiguous, polynomially-ambiguous weighted automata, and the full class of weighted automata is strict for the min-plus and max-plus semirings.
Linear algebra algorithms often require some sort of iteration or recursion as is illustrated by standard algorithms for Gaussian elimination, matrix inversion, and transitive closure. A key characteristic shared by these algorithms is that they allow looping for a number of steps that is bounded by the matrix dimension. In this paper we extend the matrix query language MATLANG with this type of recursion, and show that this suffices to express classical linear algebra algorithms. We study the expressive power of this language and show that it naturally corresponds to arithmetic circuit families, which are often said to capture linear algebra. Furthermore, we analyze several sub-fragments of our language, and show that their expressive power is closely tied to logical formalisms on semiring-annotated relations.
Time series is a common and well-known way for describing temporal data. However, most of the state-of-the-art techniques for analysing time series have focused on generating a representation for a single level of resolution. For analysing of a time series at several levels of resolutions, one would require to compute different representations, one for each resolution level. We introduce a multi-resolution representation for time series based on local trends and mean values. We require the level of resolution as parameter, but it can be automatically computed if we consider the maximum resolution of the time series. Our technique represents a time series using trend-value pairs on each segment belonging to a resolution level. To provide a useful representation for data mining tasks, we also propose dissimilarity measures and a symbolic representation based on the SAX technique for efficient similarity search using a multi-resolution indexing scheme. We evaluate our method for classification and discord discovery tasks over a diversity of data domains, achieving a better performance in terms of efficiency and effectiveness compared with some of the best-known classic techniques. Indeed, for some of the experiments, the time series mining algorithms using our multi-resolution representation were an order of magnitude faster, in terms of distance computations, than the state of the art.
In this paper we present an overview of our participation in TRECVID 2020 Video to Text Description Challenge.
Specifically, we participated in the Description Generation subtask by extending of our recent paper. We address the
limitation of previous video captioning methods that have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. We consider syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. Considering different datasets for training the model, such as VATEX and TGIF, our results represent third place by teams on the TRECVID 2020 Challenge for METEOR and CIDEr-D metrics. We also show that paying more attention to syntax improves the quality of generated descriptions.In this paper we present an overview of our participation in TRECVID 2020 Video to Text Description Challenge.
Specifically, we participated in the Description Generation subtask by extending of our recent paper. We address the
limitation of previous video captioning methods that have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. We consider syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. Considering different datasets for training the model, such as VATEX and TGIF, our results represent third place by teams on the TRECVID 2020 Challenge for METEOR and CIDEr-D metrics. We also show that paying more attention to syntax improves the quality of generated descriptions.
Adolescence is a key period in the development of individuals’ news habits, but little is known about the processes involved in the process of news media socialization. This study proposes an integrated model in which the influence of family communication on motivations and behaviors of adolescents in relation to news consumption occurs through the development of personality traits related to information processing (namely, need for cognition and need to evaluate). Structural equation modeling of data from a representative survey of 2,273 adolescents, aged 13 to 17, provide support for the theorized model, such that concept-oriented communication within families is associated to news exposure indirectly, via personality traits and motivations. Thus, the study provides an initial assessment of one way children are socialized to become news enthusiasts and news avoiders. It also provides empirical evidence that information-processing traits are influenced by family communication patterns, confirming what hitherto was theoretical speculation.
We introduce a new family of compressed data structures to efficiently store and query large string dictionaries in main memory. Our main technique is a combination of hierarchical Front-coding with ideas from longest-common-prefix computation in suffix arrays. Our data structures yield relevant space-time tradeoffs in real-world dictionaries. We focus on two domains where string dictionaries are extensively used and efficient compression is required: URL collections, a key element in Web graphs and applications such as Web mining; and collections of URIs and literals, the basic components of RDF datasets. Our experiments show that our data structures achieve better compression than the state-of-the-art alternatives while providing very competitive query times.
The analysis of painted pottery is instrumental for understanding ancient Greek society and human behavior of past cultures in Archaeology. A key part of this analysis is the discovery of cross references to establish links and correspondences. However, due to the vast amount of documented images and 3D scans of pottery objects in today’s domain repositories, manual search is very time consuming. Computer aided retrieval methods are of increasing importance. Mostly, current retrieval systems for this kind of cultural heritage data only allow to search for pottery of similar vessel’s shape. However, in many cases important similarity cues are given by motifs painted on these vessels. We present an interactive retrieval system that makes use of this information to allow for a motif-driven search in cultural heritage repositories. We address the problem of unsupervised motif extraction for preprocessing and the shape-based similarity search for Greek painted pottery. Our experimental evaluation on relevant repository data demonstrates effectiveness of our approach on examples of different motifs of interests.
Covering a set of points with k orthogonal bounding boxes is useful for implementing Spatio-temporal index structures that are built from a given dataset. In this work, we deal with the problem of covering a set of points with k-parallel axis boxes, under the restriction that the total area enclosed by the boxes must be minimized. To achieve this, we present a novel algorithm that, using dynamic programming techniques, finds the optimal solution for covering a set of points with k-bounding boxes where the total sum of the areas of the boxes is minimum. This is compared with the process of generating k-bounding boxes every l units of distance, achieving an improvement of about 50% of the unuseful area covered
We introduce an efficient approach that aims at supporting content-based queries on the Chilean Virtual Observatory. In particular we are interested in retrieving relevant information from virtual-observatory tables. This introduces several challenges that make the information-retrieval process harder. We define an algorithm that uses a compressed data structure to obtain the count of the number of occurrences of a string query within each column of a table. This kind of query has been used in the literature for faceted and semantic search as well as for retrieving information from web tables. This is in order to improve search effectiveness. We show that using only 15%-25% the space of a table our approach contains the table data (and hence the table can be deleted) and is able to answer queries efficiently in a few milliseconds.
For a sorted set S from a universe [1..u] received under the streaming model (i.e., elements are received one at a time, in sorted order), such that at a given time it contains n elements {x 1 , . . . , x n }, and whose characteristic bit vector is C S = 0(σ 1 )11···10(σ 2 )11···1 · · · 0(σ g )11···1 (i.e., the set elements are actually arranged in g <; n intervals of size ≥ 1), we propose a compact data structure that answers operations select and rank in Θ(lg(g/ lg g)) worst-case time, and append in O(1) amortized time, using 2g lg u-n/g +g lg n/g +o(g lg lg g) bits of space. The structure is suitable in cases where g ≤ n/2.
The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with an unambiguous identifier in a Knowledge Base. While much work has been done on the topic, we first present the results of a survey that reveal a lack of consensus in the community regarding what forms of mentions in a text and what forms of links the EL task should consider. We argue that no one definition of the Entity Linking task fits all, and rather propose a fine-grained categorization of different types of entity mentions and links. We then re-annotate three EL benchmark datasets – ACE2004, KORE50, and VoxEL – with respect to these categories. We propose a fuzzy recall metric to address the lack of consensus and conclude with fine-grained evaluation results comparing a selection of online EL systems.
CNN-based models currently provide state-of-the-art performance in image categorization tasks. While these methods are powerful in terms of representational capacity, they are generally not conceived with explicit means to control complexity. This might lead to scenarios where resources are used in a non-optimal manner, increasing the number of unspecialized or repeated neurons, and overfitting to data. In this work we propose CompactNets, a new approach to visual recognition that learns a hierarchy of shared, discriminative, specialized, and compact representations. CompactNets naturally capture the notion of compositional compactness, a characterization of complexity in compositional models, consisting on using the smallest number of patterns to build a suitable visual representation. We employ a structural regularizer with group-sparse terms in the objective function, that induces on each layer, an efficient and effective use of elements from the layer below. In particular, this allows groups of top-level features to be specialized based on category information. We evaluate CompactNets on the ILSVRC12 dataset, obtaining compact representations and competitive performance, using an order of magnitude less parameters than common CNN-based approaches. We show that CompactNets are able to outperform other group-sparse-based approaches, in terms of performance and compactness. Finally, transfer-learning experiments on small-scale datasets demonstrate high generalization power, providing remarkable categorization performance with respect to alternative approaches.
https://doi.org/10.1016/j.cviu.2019.102841
The article presents SHACL2SPARQL, a tool that validates an RDF graph stored as a SPARQL endpoint against possibly recursive SHACL constraints. It is based on the algorithm proposed in [3]. This implementation improves upon the original algorithm with a wider range of natively supported constraint operators, SPARQL query optimization techniques, and a mechanism to explain invalid targets.
We address the problem of estimating when the results of an input SPARQL query over dynamic RDF datasets will change. We evaluate a framework that extracts features from the query and/or from past versions of the target dataset and inputs them into binary classifiers to predict whether or not the results for a query will change at a fixed point in the near future. For this evaluation, we create a gold standard based on 23 versions of Wikidata and a curated collection of 221 SPARQL queries. Our results show that the quality of predictions possible using (only) features based on the query structure and lightweight statistics of the predicate dynamics – though capable of beating a random baseline – are not competitive with results obtained using (more costly to derive) knowledge of the complete historical changes in the query results.
Link: http://ceur-ws.org/Vol-2496/paper1.pdf
Current data processing tasks require efficient approaches capable of dealing with large databases. A promising strategy consists in distributing the data along with several computers that partially solve the undertaken problem. Finally, these partial answers are integrated to obtain a final solution. We introduce distributed shared nearest neighbors (D-SNN), a novel clustering algorithm that work with disjoint partitions of data. Our algorithm produces a global clustering solution that achieves a competitive performance regarding centralized approaches. The algorithm works effectively with high dimensional data, being advisable for document clustering tasks. Experimental results over five data sets show that our proposal is competitive in terms of quality performance measures when compared to state of the art methods.
Stance classification is the task of automatically identify the user’s positions about a specific topic. The classification of stance may help to understand how people react to a piece of target information, a task that is interesting in different areas as advertising campaigns, brand analytics, and fake news detection, among others. The rise of social media has put into the focus of this task the classification of stance in online social networks. A number of methods have been designed for this purpose showing that this problem is hard and challenging. In this work, we explore how to use self-attention models for stance classification. Instead of using attention mechanisms to learn directly from the text we use self-attention to combine different baselines’ outputs. For a given post, we use the transformer architecture to encode each baseline output exploiting relationships between baselines and posts. Then, the transformer learns how to combine the outputs of these methods reaching a consistently better classification than the ones provided by the baselines. We conclude that self-attention models are helpful to learn from baselines’ outputs in a stance classification task.
Let f:[1..n]→[1..n] be a function, and ℓ:[1..n]→[1..σ] indicate a label assigned to each element of the domain. We design several compact data structures that answer various kinds of summary queries on the labels of paths in f. For example, we can find either the minimum label in fk(i) for a given i and any k≥0 in a given range [k1..k2], or the minimum label in f−k(i) for a given i and k>0, using nlgn+nlgσ+o(nlgn) bits and time O(α(n)), the inverse Ackermann function. Within similar space we can count, in time O(lgn/lglgn), the number of labels within a range, and report each element with such labels in O(lgn/lglgn) additional time. Several other tradeoffs and possible queries are considered, such as selection, top-r queries and τ-majorities. Finally, we consider queries that allow us navigate on the graph of the function, such as the nearest common successor of two elements, or the nearest successor or predecessor of an element within a range of labels.
SHACL (Shapes Constraint Language) is a specification for describing and validating RDF graphs that has recently become a W3C recommendation. While the language is gaining traction in the industry, algorithms for SHACL constraint validation are still at an early stage. A first challenge comes from the fact that RDF graphs are often exposed as SPARQL endpoints, and therefore only accessible via queries. Another difficulty is the absence of guidelines about the way recursive constraints should be handled. In this paper, we provide algorithms for validating a graph against a SHACL schema, which can be executed over a SPARQL endpoint. We first investigate the possibility of validating a graph through a single query for non-recursive constraints. Then for the recursive case, since the problem has been shown to be NP-hard, we propose a strategy that consists in evaluating a small number of SPARQL queries over the endpoint, and using the answers to build a set of propositional formulas that are passed to a SAT solver. Finally, we show that the process can be optimized when dealing with recursive but tractable fragments of SHACL, without the need for an external solver. We also present a proof-of-concept evaluation of this last approach.
Six datasets have been published under the title of Billion Triple Challenge (BTC) since 2008. Each such dataset contains billions of triples extracted from millions of documents crawed from hundreds of domains. While these datasets were originally motivated by the annual ISWC competition from which they take their name, they would become widely used in other contexts, forming a key resource for a variety of research works concerned with managing and/or analysing diverse, real-world RDF data as found natively on the Web. Given that the last BTC dataset was published in 2014, we prepare and publish a new version – BTC-2019 – containing 2.2 billion quads parsed from 2.6 million documents on 394 pay-level-domains. This paper first motivates the BTC datasets with a survey of research works using these datasets. Next we provide details of how the BTC-2019 crawl was configured. We then present and discuss a variety of statistics that aim to gain insights into the content of BTC-2019. We discuss the hosting of the dataset and the ways in which it can be accessed, remixed and used.
LINK: https://doi.org/10.1007/978-3-030-30796-7_11
RESOURCE DOI: https://doi.org/10.5281/zenodo.2634588
Despite the growing popularity of knowledge graphs for managing diverse data at large scale, users who wish to pose expressive queries against such graphs are often expected to know (i) how to formulate queries in a language such as SPARQL, and (ii) how entities of interest are described in the graph. In this paper we propose a language that relaxes these expectations; the language’s operators are based on an interactive graph-based exploration that allows non-expert users to simultaneously navigate and query knowledge graphs; we compare the expressivity of this language with SPARQL. We then discuss an implementation of this language that we call RDF Explorer and discuss various desirable properties it has, such as avoiding interactions that lead to empty results. Through a user study over the Wikidata knowledge-graph, we show that users successfully complete more tasks with RDF Explorer than with the existing Wikidata Query Helper, while a usability questionnaire demonstrates that users generally prefer our tool and self-report lower levels of frustration and mental effort.
Worst-case optimal multiway join algorithms have recently gained a lot of attention in the database literature. These algorithms not only offer strong theoretical guarantees of efficiency but have also been empirically demonstrated to significantly improve query runtimes for relational and graph databases. Despite these promising theoretical and practical results, however, the Semantic Web community has yet to adopt such techniques; to the best of our knowledge, no native RDF database currently supports such join algorithms, wherein this paper we demonstrate that this should change. We propose a novel procedure for evaluating SPARQL queries based on an existing worst-case join algorithm called Leapfrog Triejoin. We propose an adaptation of this algorithm for evaluating SPARQL queries and implement it in Apache Jena. We then present experiments over the Berlin and WatDiv SPARQL benchmarks, and a novel benchmark that we propose based on Wikidata that is designed to provide insights into join performance for a more diverse set of basic graph patterns. Our results show that with this new join algorithm, Apache Jena often runs orders of magnitude faster than the base version and two other SPARQL engines: Virtuoso and Blazegraph.
DOI https://doi.org/10.1007/978-3-030-30793-6_15
For those of us who regularly conduct public opinion research outside of the United States and Europe, it is customary to have to explain whether our findings are “real,” that is, generalizable relationships that advance theory, or some kind of contextual artifact. Infamous Reviewer 2 will ask for an explanation of how context might be affecting the relationships that we are describing, and while it might be irritating to do so, in this case, Reviewer 2 is right. The issue of course is not having to explain how contexts matter, but instead why scholarship examining the US, or certain western countries, is not consistently subject to the same task. In this piece, we advocate for contextualizing public opinion research in all cases, because of course relationships among variables are always context/historic dependent. Rather than being a theoretical shortcoming, we argue that this becomes a theoretical strength: being able to identify the conditions under which our proposed relationships hold and those in which they do not. So, for example, rather than just saying that news consumption is positively related with political participation, as scholars including ourselves have been doing for years, we need to make explicit the news construction conditions under which this is the case, the participatory repertoire being considered, and normative implications of our claims. To engage contextualization, cross-national, cross cultural, cross group and historical comparisons are particularly useful. To build our case for the increasing need for contextualization in political communication research, we will first examine some early comparative research, then we will show some current problematic comparisons, and finally will end with some concluding remarks of the challenges for the field that lie ahead and the benefits of a contextual approach.
https://doi.org/10.1080/10584609.2019.1670897
We show that the topological model, a semantically rich standard to represent GIS data, can be encoded succinctly while efficiently answering a number of topology-related queries. We build on recent succinct planar graph representations so as to encode a model with m edges within 4m+o(m)4m+o(m) bits and answer various queries relating nodes, edges, and faces in o(log log m)o(log log m) time, or any time in ω(logm)ω(logm) for a few complex ones.
This paper proposes a practical implementation of an alphabet-partitioning compressed data structure, which represents a string within compressed space and supports the fundamental operations rank and select efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach (which is one of the most efficient approaches in practice). In particular, the time for operation select can be reduced by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), and the distributed-computation processing of rank and select operations. As far as we know, this is the first study about the support of rank/select operations on a distributed-computing environme03
Representing a static set of integers S, |S|=n|S|=n from a finite universe U=[1..u]U=[1..u] is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of rank and select on S; if S is viewed as its characteristic vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures.
Although there is an information-theoretic lower bound of B(n,u)=lg(un)B(n,u)=lg(un) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain non-trivial runs of consecutive elements, one that occurs in many practical situations.
Let CnCn denote the class of (un)(un) distinct sets of nn elements over the universe [1..u][1..u]. Let also Cng⊂CnCgn⊂Cn contain the sets whose nn elements are arranged in g≤ng≤n runs of ℓi≥1ℓi≥1 consecutive element from U for i=1,…,gi=1,…,g, and let Cng,r⊂CngCg,rn⊂Cgn contain all sets that consist of g runs, such that r≤gr≤g of them have at least 2 elements.
-
We introduce new compressibility measures for sets, including:
-
L1=lg|Cng|=lg(u−n+1g)+lg(n−1g−1)L1=lg|Cgn|=lg(u−n+1g)+lg(n−1g−1) and
-
L2=lg|Cng,r|=lg(u−n+1g)+lg(n−g−1r−1)+lg(gr)L2=lg|Cg,rn|=lg(u−n+1g)+lg(n−g−1r−1)+lg(gr)
We show that L2≤L1≤B(n,u)L2≤L1≤B(n,u).
-
-
We give data structures that use space close to bounds L1L1 and L2L2 and support rank and select in O(1) time.
-
We provide additional measures involving entropy-coding run lengths and gaps between items, data structures to support these measures, and show experimentally that these approaches are promising for real-world datasets.
The 𝑘²-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to d dimensions, where it is called a 𝑘𝑑-tree. The representation boils down to a long bitvector. We show that interpreting the 𝑘𝑑-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector as done in previous work, yields operation times that are below the lower bound of dynamic bitvectors and offers improved time performance in practice.
Radical populists in the Andes have combined a populist program and a multicultural agenda. However, while populism centralizes power in the hands of the leader and emphasizes the unity of the people, multiculturalism grants cultural rights that strengthen societal autonomy, generating an inherent tension between these two modes of incorporation. How are populist governments able to combine unity and fragmentation as well as centralization and autonomy? This article develops the concept of populist multiculturalism, focusing on the Movimiento al Socialismo (MAS) in Bolivia, which has supported autonomy rights while simultaneously curtailing their implementation. Specifically, it examines the implementation of indigenous autonomous governments and prior consultation and the relationship between indigenous organizations and the ruling party. The article also extends this concept to Ecuador and Venezuela.
The semantic network of a dictionary is a mathematical structure that represents relationships among words of a language. In this work, we study the evolution of the semantic network of the Spanish dictionary during the last century, beginning in 1925 until 2014. We analysed the permanence and changes of its structural properties, such as size of components, average shortest path length, and degree distribution. We found that global structural properties of the Spanish dictionary network are remarkably stable. In fact, if we remove all the labels from the network, networks from different editions of the Spanish dictionary are practically indistinguishable. On the other hand, local properties change over the years offering insights about the evolution of lexicon. For instance, the neighbourhood of a single word or the shared neighbourhood between a pair of words. This paper presents preliminary evidence that dictionary networks are an interesting language tool and good proxies to study semantic clouds of words and their evolution in a given language.
LINK: https://elex.link/elex2019/wp-content/uploads/2019/09/eLex_2019_47.pdf
Political parties with activists are in decline due to various external shocks. Societal changes, like the emergence of new technologies of communication, have diminished the role and number of activists, while party elites increasingly can make do without grassroots activists. However, recent scholarship concerning different democracies has shown how activism still matters for representation. This book contributes to this literature by analyzing the unique case of the Uruguayan Frente Amplio (FA), the only mass-organic, institutionalized leftist party in Latin America. Using thick description, systematic process tracing, and survey research, this case study highlights the value of an organization-centered approach for understanding parties’ role in democracy. Within the FA, organizational rules grant activists a significant voice, which imbues activists’ participation with a strong sense of efficacy. This book is an excellent resource for scholars and students of Latin America and comparative politics who are interested in political parties and the challenges confronting new democracies.
To determine whether a document or a sentence expresses a positive or negative sentiment, three main approaches are commonly used: the lexicon-based approach, corpus-based approach, and a hybrid approach. The study of sentiment analysis in English has the highest number of sentiment analysis studies, while research is more limited for other languages, including Arabic and its dialects. Lexicon based approaches need annotated sentiment lexicons (containing the valence and intensity of its terms and expressions). Corpus-based sentiment analysis requires annotated sentences. One of the significant problems related to the treatment of Arabic and its dialects is the lack of these resources. We present in this survey the most recent resources and advances that have been done for Arabic sentiment analysis. This survey presents recent work (where the majority of these works are between 2015 and 2019). These works are classified by category (survey work or contribution work). For contribution work, we focus on the construction of sentiment lexicon and corpus. We also describe emergent trends related to Arabic sentiment analysis, principally associated with the use of deep learning techniques.
Given an arbitrary set A of two-dimensional points over a totally-ordered domain, a two-sided planar range query consists on finding all points of A within an arbitrary quadrant. In this paper we present a novel data structure that uses linear space in |A| while allowing for two-dimensional orthogonal range queries with logarithmic pre-processing and constant-delay enumeration.
E-Sports has been positioned as an important activity within MOBA (Multiplayer Online Battle Arena) games in recent years. There is existing research on recommender systems in this topic, but most of it focuses on the character recommendation problem. However, the recommendation of items is also challenging because of its contextual nature, depending on the other characters. We have developed a framework that suggests items for a character based on the match context. The system aims to help players who have recently started the game as well as frequent players to take strategic advantage during a match and to improve their purchasing decision making. By analyzing a dataset of ranked matches through data mining techniques, we can capture purchase dynamic of experienced players to use it to generate recommendations. The results show that our proposed solution yields up to 80% of mAP, suggesting that the method leverages context information successfully. These results, together with open issues we mention in the paper, call for further research in the area.
Clustering-based recommender systems bound the seek of similar users within small user clusters providing fast recommendations in large-scale datasets. Then groups can naturally be distributed into different data partitions scaling up in the number of users the recommender system can handle. Unfortunately, while the number of users and items included in a cluster solution increases, the performance in terms of precision of a clustering-based recommender system decreases. We present a novel approach that introduces a cluster-based distance function used for neighborhood computation. In our approach, clusters generated from the training data provide the basis for neighborhood selection. Then, to expand the search of relevant users, we use a novel measure that can exploit the global cluster structure to infer cluster-outside user’s distances. Empirical studies on five widely known benchmark datasets show that our proposal is very competitive in terms of precision, recall, and NDCG. However, the strongest point of our method relies on scalability, reaching speedups of 20× in a sequential computing evaluation framework and up to 100× in a parallel architecture. These results show that an efficient implementation of our cluster-based CF method can handle very large datasets providing also good results in terms of precision, avoiding the high computational costs involved in the application of more sophisticated techniques.
Deep learning is a branch of machine learning that generates multi-layered representations of data, commonly using artificial neural networks, and has improved the state-of-the-art in various machine learning tasks (e.g., image classification, object detection, speech recognition, and document classification). However, most popular deep learning frameworks such as TensorFlow and PyTorch require users to write code to apply deep learning. We present WekaDeeplearning4j, a Weka package that makes deep learning accessible through a graphical user interface (GUI). The package uses Deeplearning4j as its backend, provides GPU support, and enables GUI-based training of deep neural networks such as convolutional and recurrent neural networks. It also provides pre-processing functionality for image and text data.
Inorganic arsenic (As) is a toxic xenobiotic and carcinogen associated with severe health conditions. The urban population from the Atacama Desert in northern Chile was exposed to extremely high As levels (up to 600 µg/l) in drinking water between 1958 and 1971, leading to increased incidence of urinary bladder cancer (BC), skin cancer, kidney cancer, and coronary thrombosis decades later. Besides, the Andean Native-American ancestors of the Atacama population were previously exposed for millennia to elevated As levels in water (∼120 µg/l) for at least 5,000 years, suggesting adaptation to this selective pressure. Here, we performed two genome-wide selection tests—PBSn1 and an ancestry-enrichment test—in an admixed population from Atacama, to identify adaptation signatures to As exposure acquired before and after admixture with Europeans, respectively. The top second variant selected by PBSn1 was associated with LCE4A-C1orf68, a gene that may be involved in the immune barrier of the epithelium during BC. We performed association tests between the top PBSn1 hits and BC occurrence in our population. The strongest association (P = 0.012) was achieved by the LCE4A-C1orf68 variant. The ancestry-enrichment test detected highly significant signals (P = 1.3 × 10−9) mapping MAK16, a gene with important roles in ribosome biogenesis during the G1 phase of the cell cycle. Our results contribute to a better understanding of the genetic factors involved in adaptation to the pathophysiological consequences of exposure.
The raster model is commonly used for the representation of images in many domains and is especially useful in Geographic Information Systems (GIS) to store information about continuous variables of the space (elevation, temperature, etc.). Current representations of raster data are usually designed for external memory or, when stored in main memory, lack efficient query capabilities. In this paper, we propose compact representations to efficiently store and query raster datasets in the main memory. We present different representations for binary raster data, general raster data, and time-evolving raster data. We experimentally compare our proposals with traditional storage mechanisms such as linear quadtrees or compressed GeoTIFF files. Results show that our structures are up to 10 times smaller than classical linear quadtrees, and even comparable in space to non-queriable representations of raster data, while efficiently answering a number of typical queries.
Interactive technologies are changing the ways we learn facts, develop attitudes and participate in politics, with the ensuing risk of increasing pre-existing inequalities. Addressing this challenge is the duty of researchers, technology companies, governments and news organizations.
This article deals with fieldwork in challenging research contexts that make preparation for field research particularly difficult. Challenging contexts include generally insecure places, politicized contexts, and unknown settings. Drawing on our experience in the field, we discuss four challenges that are common across these contexts: access, positionality, researcher well-being, and research design, and data collection. Bringing together insights from fieldwork with urban elites and in the countryside, this paper describes problems that occurred in both settings and identifies a set of interpersonal skills that helped the authors to tackle the challenges of the field and seize the opportunities it offered. This article posits that recognizing the importance of certain interpersonal skills, namely: openness, empathy, humility, and flexibility, precedes the identification of practical tools. Interpersonal skills, instead, focus on a general attitude that underlies researchers’ capacity to make informed choices about specific courses of action, preparing fieldworkers to be prepared to confront problems once they arise.
This paper describes a submission to the Word-in-Context competition for the IJCAI 2019 SemDeep-5 workshop. The task is to determine whether a given focus word is used in the same or different senses in two contexts. We took an ELMo-inspired approach similar to the baseline model in the task description paper, where contextualized representations are obtained for the focus words and a classification is made according to the degree of similarity between these representations. Our model had a few simple differences, notably joint training of the forward and backward LSTMs, a different choice of states for the contextualized representations and a new similarity measure for them. These changes yielded a 3.5% improvement on the ELMo baseline.
Introduction: Deep Neural Networks (DNNs) are a type of machine learning model (Goodfellow et al, 2016) which have reported state-of-the-art results in several tasks in the past years. Despite the impressive results reported by these models in several fields such as computer vision (Krizhevsky et al., 2014), natural language processing (Mikolov et al., 2013) or recommender systems (Covington et al., 2016), one of their biggest drawbacks is their lack of interpretability and transparency. Some of the best performing DNN models have millions of parameters, so making sense of what these models learn is an active research challenge. These algorithms can help to solve and automate difficult and expensive tasks, but their adoption in critical domains, which usually requires liability, depends on making their decision interpretable by humans. Some large funding programs such as DARPA XAI (Gunning and Aha, 2019) are addressing this problem, providing evidence of their importance. On the other side, recent legislation such as Europe’s GDPR gives people the right to explainability of automated decisions regarding their private data.
One of the most significant techniques introduced to DNNs in the latest years is the so called attention mechanism (Larrochelle and Hinton, 2010). The idea is inspired by our visual system, since humans focus selectively on parts rather than on a whole image, combining information from several fixations to form the full scene (Mnih et al, 2014). This mechanism allows the network to focus on a subset of inputs or parameters when trained on a task. Attention has improved the performance of these models, and it has also given them a chance to be more explainable. Inspecting what the model is paying attention to helps to make the model accountable in tasks such as image classification, document classification or automatic image captioning. Despite this potential, researchers in the area of machine learning usually use the traditional visualization idioms available in software packages, rather than studying all the options for visual encodings to represent models, results or parameters more effectively. We see a chance of using design principles from information visualization in order to improve the way that neural attention models are visually presented.
This article focuses on the design space to analyze, inspect and understand what neural attention models are learning. In particular, we aim at contributing to the field of Explainable Artificial Intelligence (XAI), by describing the potential design space as well as informed decisions to take into account when presenting the results of neural networks using the attention mechanism. We also propose some initial ideas with a use case: classification of biomedical documents.
https://observablehq.com/@clpuc/analyzing-the-design-space-for-visualizing-neural-attenti
Document screening is a fundamental task within Evidencebased Medicine (EBM), a practice that provides scientific evidence to support medical decisions. Several approaches are attempting to reduce the workload of physicians who need to screen and label hundreds or thousands of documents in order to answer specific clinical questions. Previous works have attempted to semi-automate document screening, reporting promising results, but their evaluation is conducted using small datasets, which hinders generalization. Moreover, some recent works have used recently introduced neural language models, but no previous work have compared, for this task, the performance of different language models based on neural word embeddings, which have reported good results in the latest years for several NLP tasks. In this work, we evaluate the performance of two popular neural word embeddings (Word2vec and GloVe) in an active learning-based setting for document screening in EBM, with the goal of reducing the number of documents that physicians need to label in order to answer clinical questions. We evaluate these methods in a small public dataset (HealthCLEF 2017) as well as a larger one (Epistemonikos). Our experiments indicate that Word2vec have less variance and better general performance than GloVe when using active learning strategies based on uncertainty sampling.
http://ceur-ws.org/Vol-2414/paper10.pdf
Scholars have advanced many theoretical explanations for expecting a negative or positive relationship between individuals’ cross-cutting exposure—either through interpersonal or mediated forms of communication—and their political participation. However, whether cross-cutting exposure is a positive or negative predictor of participation is still an unsettled question. To help fill this gap, we conducted a meta-analysis of 48 empirical studies comprising more than 70,000 participants examining the association between cross-cutting exposure and political participation. The meta-analysis produced two main findings. First, it shows that, over all studies, there is no significant relationship, r = .002, Zr = .002 (95% CI = −.04 to .05). Second, the null relationship cannot be explained by variations in the characteristics of cross-cutting environments (e.g., topic, place, or source of exposure), participation outcomes (e.g., online vs. offline activities), or methods employed (e.g., experiment vs. survey). Taken together, these results should alleviate concerns about negative effects of cross-cutting exposure on political engagement. Implications for future research are discussed.
Online activity involves the consumption and production of event-related content. There are about 500 million Twitter messages published every day, and according to surveys, 59% of its users use the platform as a way to get the news. Its high rate of production of multimodal content (text, images, and videos) necessitates having flexible models to understand the dynamics of the information disseminated on social media. This thesis proposes the creation of context models from usergenerated messages on Twitter to discover knowledge as a way to perform high-level quantitative analysis of news events. These models are useful in three perspectives: the spatio-temporal context in which the events develop, the activity of users that react when a high-impact event happens, and the multimodal content that can be exploited to generate a comprehensive summary of the event. Our current work involves the creation of a geopolitical model that relates events and countries, allowing us to discover international relations; the study of what features make an event susceptible to provoke high activity from users, and a characterization that allows us to predict with high precision which events are going to produce high activity. This includes our ongoing work on generating automatic multimodal summaries of events based on the assumption that the users describe the non-textual content in their tweets when they express their facts and opinions around events.
Introduction Memory-bounded search algorithms are typically used when the search space is too large for regular best-first search algorithms like A* to store in memory. There exists a large class of memory-bounded best-first search algorithms including Depth-First Branch-and-Bound (DFBnB), Iterative Deepening A* (IDA*) (Korf 1985), Recursive Best-First Search (RBFS) (Korf 1993), and Simplified Memory-Bounded A* (SMA*) (Russell 1992). Each of these algorithms rely on a different strategy to ensure that they use only a bounded amount of memory: IDA* bounds the amount of memory used by repeatedly running depth-first searches, increasing the explored depth at each iteration. RBFS uses lower and upper bounds that are tightened over time as it explores the search space while keeping only b · d nodes in memory, where b is the branching factor and d is the depth of the tree. And, finally, SMA* keeps only a bounded number of nodes in memory by pruning the least promising nodes from the OPEN list when it runs out of memory. In this abstract, we summarize an alternative approach to memory-bounded best-first search. It is motivated by realtime heuristic search algorithms (Korf 1990), many of which iterate the following steps until the goal is reached: up to k nodes are expanded, where k is a user-defined bound; the h values of expanded nodes are updated to make them more informed; the agents moves along a path along the search tree just expanded. We propose a general framework that iteratively (1) runs a memory-bounded best-first search algorithm that terminates when k nodes are generated. If no solution is found, (2) it updates the h-values of the generated nodes, and (3) purges the h values of some nodes from memory. As such, the total number of h-values ever stored by our approach is upper-bounded by a constant. Under certain (reasonable) conditions, our framework is complete and preserves the (sub)optimality guarantees of the given best-first search algorithm in tree-shaped search spaces. The main conceptual difference between our framework and the SMA* algorithm is that it can be combined with any bestfirst algorithm with very minor modifications. We present experimental results where we plug into our framework memory-bounded variants of Weighted A* (Pohl 1970). On traveling salesman problems we show that our framework is often able to find better solutions than DFBnB and Weighted DFBnB (wDFBnB) and in a smaller amount of time, especially in problems with large search spaces.
https://aaai.org/ocs/index.php/SOCS/SOCS19/paper/viewFile/18376/17491
Introduction: Multi-Agent Pathfinding (MAPF) over grids is the problem of finding n non-conflicting paths that lead n agents from a given initial cell to a given goal cell. Sum-of-costsoptimal MAPF, or simply cost-optimal MAPF, in addition, minimizes the total number of actions performed by each agent before stopping at the goal. Being a combinatorial problem in nature, a number of compilations from MAPF to Satisfiability (SAT) (Surynek et al. 2016) and Answer Set Programming (ASP) exist (Erdem et al. 2013; Gebser et al. 2018). Here we propose and evaluate a new compilation of MAPF over grids to ASP. Unlike existing compilations we are aware of, both to SAT and to ASP, our encoding is the first that produces a number of clauses that is linear on the number of agents. In addition, the clauses that allow representing the optimization objective are also efficiently written, and do not depend on the size of the grid. Like makespan-optimal approaches, our algorithm searches for cost-optimal solutions with increasing makespan. When a solution is found a provably correct upper bound on the maximum makespan at which a true cost-optimal solution exists is computed, and the solver is rerun once more.
https://aaai.org/ocs/index.php/SOCS/SOCS19/paper/viewFile/18374/17489
We study the boundedness problem for unions of conjunctive regular path queries with inverses (UC2RPQs). This is the problem of, given a UC2RPQ, checking whether it is equivalent to a union of conjunctive queries (UCQ). We show the problem to be ExpSpace-complete, thus coinciding with the complexity of containment for UC2RPQs. As a corollary, when a UC2RPQ is bounded, it is equivalent to a UCQ of at most triple-exponential size, and in fact we show that this bound is optimal. We also study better behaved classes of UC2RPQs, namely acyclic UC2RPQs of bounded thickness, and strongly connected UCRPQs, whose boundedness problem is, respectively, PSpace-complete and Pi_2^P-complete. Most upper bounds exploit results on limitedness for distance automata, in particular extending the model with alternation and two-wayness, which may be of independent interest.
Monadic decomposibility – the ability to determine whether a formula in a given logical theory can be decomposed into a boolean combination of monadic formulas – is a powerful tool for devising a decision procedure for a given logical theory. In this paper, we revisit a classical decision problem in automata theory: given a regular (a.k.a. synchronized rational) relation, determine whether it is recognizable, i.e., it has a monadic decomposition (that is, a representation as a boolean combination of cartesian products of regular languages). Regular relations are expressive formalisms which, using an appropriate string encoding, can capture relations definable in Presburger Arithmetic. In fact, their expressive power coincide with relations definable in a universal automatic structure; equivalently, those definable by finite set interpretations in WS1S (Weak Second Order Theory of One Successor). Determining whether a regular relation admits a recognizable relation was known to be decidable (and in exponential time for binary relations), but its precise complexity still hitherto remains open. Our main contribution is to fully settle the complexity of this decision problem by developing new techniques employing infinite Ramsey theory. The complexity for DFA (resp. NFA) representations of regular relations is shown to be NLOGSPACE-complete (resp. PSPACE-complete).
This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work Claude et al., (2016). In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as a reproducibility package.
Māori loanwords are widely used in New Zealand English for various social functions by New Zealanders within and outside of the Māori community. Motivated by the lack of linguistic resources for studying how Māori loanwords are used in social media, we present a new corpus of New Zealand English tweets. We collected tweets containing selected Māori words that are likely to be known by New Zealanders who do not speak Māori. Since over 30% of these words turned out to be irrelevant, we manually annotated a sample of our tweets into relevant and irrelevant categories. This data was used to train machine learning models to automatically filter out irrelevant tweets.
The sheer amount of newsworthy information published by users in social media platforms makes it necessary to have efficient and effective methods to filter and organize content. In this scenario, off-the-shelf methods fail to process large amounts of data, which is usually approached by adding more computational resources. Simple data aggregations can help to cope with space and time constraints, while at the same time improve the effectiveness of certain applications, such as topic detection or summarization. We propose a lightweight representation of newsworthy social media data. The proposed representation leverages microblog features, such as redundancy and re-sharing capabilities, by using surrogate texts from shared URLs and word embeddings. Our representation allows us to achieve comparable clustering results to those obtained by using the complete data, while reducing running time and required memory. This is useful when dealing with noisy and raw user-generated social media data.
Hate speech is an important problem that is seriously affecting the dynamics and usefulness of online social communities. Large scale social platforms are currently investing important resources into automatically detecting and classifying hateful content, without much success. On the other hand, the results reported by state-of-the-art systems indicate that supervised approaches achieve almost perfect performance but only within specific datasets. In this work, we analyze this apparent contradiction between existing literature and actual applications. We study closely the experimental methodology used in prior work and their generalizability to other datasets. Our findings evidence methodological issues, as well as an important dataset bias. As a consequence, performance claims of the current state-of-the-art have become significantly overestimated. The problems that we have found are mostly related to data overfitting and sampling issues. We discuss the implications for current research and re-conduct experiments to give a more accurate picture of the current state-of-the art methods.
Signaling pathways are responsible for the regulation of cell processes, such as monitoring the external environment, transmitting information across membranes, and making cell fate decisions. Given the increasing amount of biological data available and the recent discoveries showing that many diseases are related to the disruption of cellular signal transduction cascades, in silico discovery of signaling pathways in cell biology has become an active research topic in past years. However, reconstruction of signaling pathways remains a challenge mainly because of the need for systematic approaches for predicting causal relationships, like edge direction and activation/inhibition among interacting proteins in the signal flow. We propose an approach for predicting signaling pathways that integrates protein interactions, gene expression, phenotypes, and protein complex information. Our method first finds candidate pathways using a directed-edge-based algorithm and then defines a graph model to include causal activation relationships among proteins, in candidate pathways using cell cycle gene expression and phenotypes to infer consistent pathways in yeast. Then, we incorporate protein complex coverage information for deciding on the final predicted signaling pathways. We show that our approach improves the predictive results of the state of the art using different ranking metrics.24
In ontology-mediated querying, description logic (DL) ontologies are used to enrich incomplete data with domain knowledge which results in more complete answers to queries. However, the evaluation of ontology-mediated queries (OMQs) over relational databases is computationally hard. This raises the question when OMQ evaluation is efficient, in the sense of being tractable in combined complexity or fixed-parameter tractable. We study this question for a range of ontology-mediated query languages based on several important and widely-used DLs, using unions of conjunctive queries as the actual queries. For the DL ELHI⊥, we provide a characterization of the classes of OMQs that are fixed-parameter tractable. For its fragment ELH⊥ dr , which restricts the use of inverse roles, we provide a characterization of the classes of OMQs that are tractable in combined complexity. Both results are in terms of equivalence to OMQs of bounded tree width and rest on a reasonable assumption from parameterized complexity theory. They are similar in spirit to Grohe’s seminal characterization of the tractable classes of conjunctive queries over relational databases. We further study the complexity of the meta problem of deciding whether a given OMQ is equivalent to an OMQ of bounded tree width, providing several completeness results that range from NP to 2ExpTIME, depending on the DL used. We also consider the DL-Lite family of DLs, including members that, unlike εLHI ⊥ , admit functional roles.
Abstract. A typical reverse engineering problem for a query language L is, given a database D and a sets P and N of tuples over D labeled as positive and negative examples, respectively, is there a query q in L that explains P and N, i.e., the evaluation of q on D contains all positive examples in P and none of the negative examples in N? Applications of reverse engineering problems include query-by-example, classifier engineering, and the study of the expressive power of query languages. In this talk I will present a family of tests that solve the reverse engineering problem described above for several query languages of interest, e.g., FO, CQ, UCQs, RPQs, CRPQs, etc. We will see that in many cases such tests directly provide optimal bounds for the problem, as well as for the size of the smallest query that explains the given labeled examples. I will also present restrictions that alleviate the complexity of the problem when it is too high. Finally, I will develop the relationship between reverse engineering and a separability problem recently introduced to assist the task of feature engineering with data management tools.
The mechanisms by which users of platforms such as Facebook and Twitter spread misinformation are not well understood. In this study, we argue that the effects of informational uses of social media on political participation are inextricable from its effects on misinformation sharing. That is, political engagement is both a major consequence of using social media for news as well as a key antecedent of sharing misinformation. We test our expectations via a two-wave panel survey of online media users in Chile, a country experiencing information disorders comparable to those of the global North. Analyses of the proposed and alternative causal models with two types of structural equation specifications (fixed effects and autoregressive) support our theoretical model. We close with a discussion on how changes in the way people engage with news and politics – brought about by social media – have produced a new dilemma: how to sustain a citizenry that is enthusiastically politically active, yet not spreading misinformation?
The field of knowledge compilation establishes the tractability of many tasks by studying how to compile them to Boolean circuit classes obeying some requirements such as structuredness, decomposability, and determinism. However, in other settings such as intensional query evaluation on databases, we obtain Boolean circuits that satisfy some width bounds, e.g., they have bounded treewidth or pathwidth. In this work, we give a systematic picture of many circuit classes considered in knowledge compilation and show how they can be systematically connected to width measures, through upper and lower bounds. Our upper bounds show that bounded-treewidth circuits can be constructively converted to d-SDNNFs, in time linear in the circuit size and singly exponential in the treewidth; and that bounded-pathwidth circuits can similarly be converted to uOBDDs. We show matching lower bounds on the compilation of monotone DNF or CNF formulas to structured targets, assuming a constant bound on the arity (size of clauses) and degree (number of occurrences of each variable): any d-SDNNF (resp., SDNNF) for such a DNF (resp., CNF) must be of exponential size in its treewidth, and the same holds for uOBDDs (resp., n-OBDDs) when considering pathwidth. Unlike most previous work, our bounds apply to any formula of this class, not just a well-chosen family. Hence, we show that pathwidth and treewidth respectively characterize the efficiency of compiling monotone DNFs to uOBDDs and d-SDNNFs with compilation being singly exponential in the corresponding width parameter. We also show that our lower bounds on CNFs extend to unstructured compilation targets, with an exponential lower bound in the treewidth (resp., pathwidth) when compiling monotone CNFs of constant arity and degree to DNNFs (resp., nFBDDs).
Drawing on social-psychological and political research, we offer a theoretical model that explains how people become exposed to fake news, come to believe in them and then share them with their contacts. Using two waves of a nationally representative sample of Chileans with internet access, we pinpoint the relevant causal factors. Analysis of the panel data indicate that three groups of variables largely explain these phenomena: (1) Personal and psychological factors such as belief in conspiracy theories, trust in others, education and gender; (2) Frequency and specific uses of social media; and (3) Political views and online activism. Importantly, personal and political-psychological factors are more relevant in explaining this behavior than specific uses of social media.
The popularity of microblogging websites makes them important for information dissemination. The diffusion of large volumes of fake or unverified information could emerge and spread producing damage. Due to the ever-increasing volume of data and the nature of complex diffusion, automatic rumor detection is a very challenging task. Supervised classification and other approaches have been widely used to identify rumors in social media posts. However, despite achieving competitive results, only a few studies have delved into the nature of the problem itself in order to identify key empirical factors that allow defining both the baseline models and their performance. In this work, we learn discriminative features from tweets content and propagation trees by following their sequential propagation structure. To do this we study the performance of a number of architectures based on recursive neural networks conditioning for rumor detection. In addition, to ingest tweets into each network, we study the effect of two different word embeddings schemes: Glove and Google news skip-grams. Results on the Twitter16 dataset show that model performance depends on many empirical factors and that some specific experimental configurations consistently drive to better results.
The Mercalli scale of quake damages is based on perceived effects and it has a strong dependence on observers. Recently, we proposed a method for ground shaking intensity estimation based on lexical features extracted from tweets, showing good performance in terms of mean absolute error (MAE). One of the flaws of that method is the detection of the region of interest, i.e., the area of a country where the quake was felt. Our previous results showed enough recall in terms of municipality recovery but a poor performance in terms of accuracy. One of the reasons that help to explain this effect is the presence of data noise as many people comment or confirm a quake in areas where the event was unperceived. This happens because people get awareness of an event by watching news or by word-of-mouth propagation. To alleviate this problem in our earthquake detection system we study how propagation features behave in a region of interest estimation task. The intuition behind our study is that the patterns that characterize a word-of-mouth propagation differ from the patterns that characterize a perceived event. If this intuition is true, we expect to separate both kinds of propagation modes. We do this by computing a number of features to represent propagation trees. Then, we trained a learning algorithm using our features in the specific task of region of interest estimation. Our results show that propagation features behave well in this task, outperforming lexical features in terms of accuracy.
Social media is the primary source of information for many people around the world, not only to know about their families and friends but also to read about news and trends in different areas of interest. Fake News or rumors can generate big problems of misinformation, being able to change the mindset of a large group of people concerning a specific topic. Many companies and researchers have put their efforts into detecting these rumors with machine learning algorithms creating reports of the influence of these “news” in social media (https://www.knightfoundation.org/reports/disinformation-fake-news-and-influence-campaigns-on-twitter). Only a few studies have been made in detecting rumors in real-time, considering the first hours of propagation. In this work, we study the spread of a claim, analyzing different characteristics and how propagation patterns behave in time. Experiments show that rumors have different behaviours that can be used to classify them within the first hours of propagation.
Trajectory patterns characterize similar behaviors among trajectories, which play an important role in applications such as urban planning, traffic congestion control, and studies of animal migration and natural phenomena. In this paper we model trajectories as a sequence of line segments that represent the steady movement of an object along time. We use a segment-clustering process to group trajectories’ segments and partial segments based on their temporal and spatial closeness. Then, it defines a trajectory pattern that results from the aggregation of segment clusters, aggregation that is not only based on spatial and temporal sequentiality, but also on the compatibility of trajectories in each segment cluster. The experimental assessment shows the effectiveness of the method.
G-CORE is a query language with two key characteristics: It is closed under graphs and incoporates paths as first-class citizens. Currently G-CORE does not have recursion. In this paper we propose this extension and show how to code classical polynomial graph algorithms with it.
RDF and Property Graph databases are two approaches for data management that are based on modeling, storing and querying graph-like data. In this paper, we present a short study about the interoperability between these approaches. We review the current solutions to the problem, identify their features, and discuss the inherent issues.
People use the news media to learn about the world beyond their family, neighborhood, and workplace. As news consumers, we depend on what television, social media, websites, radio stations, and newspapers decide to inform us about. This is because all news media, whether through journalists or digital algorithms, select, process, and filter information to their users. Over time, the aspects that are prominent in the news media usually become prominent in public opinion. The ability of journalists to influence which issues, aspects of these issues, and persons related to these issues, are perceived as the most salient has come to be called the agenda-setting effect of journalism.
First described by Maxwell McCombs and Donald Shaw in a seminal study conducted during the 1968 elections in the United States, agenda-setting theory has expanded to include several other aspects beyond the transfer of salience of issues from the media agenda to the public agenda. These aspects include: the influence of journalism on the attributes of issues and people that make news; the networks between the different elements in the media and public agendas; the determinants of the news media agenda; the psychological mechanisms that regulate agenda-setting effects; and the consequences of agenda setting on both citizens’ and policymakers’ attitudes and behaviors. As one of the most comprehensive and international theories of journalism studies available, agenda setting continues to evolve in the expanding digital media landscape.
https://doi.org/10.1093/acrefore/9780190228613.013.777
This study seeks to deepen our understanding of the factors that explain individuals’ willingness to self-censor (WtSC)—the proclivity to withhold an opinion from an audience perceived to disagree with that opinion. It does so by testing the “impressionable years” hypothesis, which states that the historical context experienced between the age of 18 and 25 years has a lasting effect on individual dispositions such as WtSC. The study was conducted in Chile, an ideal case to explore possible cohort effects because of the profound political changes experienced there in the past 50 years. Analysis of an original cross-sectional survey shows that—as expected—people who came of age in periods of political repression exhibit significantly higher levels of WtSC later in life compared with those who grew up during less repressive times.
We consider the feature-generation task wherein we are given a database with entities labeled as positive and negative examples, and the goal is to find feature queries that allow for a linear separation between the two sets of examples. We focus on conjunctive feature queries, and explore two fundamental problems: (a) deciding whether separating feature queries exist (separability), and (b) generating such queries when they exist. In the approximate versions of these problems, we allow a predefined fraction of the examples to be misclassified. To restrict the complexity of the generated classifiers, we explore various ways of regularizing (i.e., imposing simplicity constraints on) them by limiting their dimension, the number of joins in feature queries, and their generalized hypertree width (ghw). Among other results, we show that the separability problem is tractable in the case of bounded ghw; yet, the generation problem is intractable, simply because the feature queries might be too large. So, we explore a third problem: classifying new entities without necessarily generating the feature queries. Interestingly, in the case of bounded ghw we can efficiently classify without ever explicitly generating the feature queries.
Tensors are one of the most widely used data structures in modern Machine Learning applications. Although they provide a flexible way of storing and accessing data, they often expose too many low-level details that may result in error prone code that is difficult to maintain and extend. Abstracting low-level functionalities into high-level operators in the form of a query language is a task in which the Data Management community has extensive experience. It is thus important to understand how such an experience can be applied in the design of useful languages for tensor manipulation.
In this short paper we study a matrix and a tensor query language that have been recently proposed in the database literature. We show, by using examples, how these proposals are in line with the practical interest in rethinking tensor abstractions. On the technical side, we compare the two languages in terms of operators that naturally arise in Machine Learning pipelines, such as convolution, matrix-inverse, and Einstein summation. We hope our results to provide a theoretical kick-off for the discussion on the design of core declarative query languages for tensors.
In this article we review the main concepts around database repairs and consistent query answering, with emphasis on tracing back the origin, motivation, and early developments. We also describe some research directions that has spun from those main concepts and the original line of research. We emphasize, in particular, fruitful and recent connections between repairs and causality in databases.
We present a method to support high quality visualization recommendations for analytic tasks. Visualization converts large datasets into images that allow viewers to efficiently explore, discover, and validate within their data. Visualization recommenders have been proposed that store past sequences: an ordered collection of design choices leading to successful task completion; then match them against an ongoing visualization construction. Based on this matching, a system recommends visualizations that better support the analysts’ tasks. A problem of scalability occurs when many sequences are stored. One solution would be to index the sequence database. However, during matching we require sequences that are similar to the partially constructed visualization, not only those that are identical. We implement a locality sensitive hashing algorithm that converts visualizations into set representations, then uses Jaccard similarity to store similar sequence nodes in common hash buckets. This allows us to match partial sequences against a database containing tens of thousands of full sequences in less than 100ms. Experiments show that our algorithm locates 95% or more of the sequences found in an exhaustive search, producing high-quality visualization recommendations.
LINK: https://doi.org/10.20380/GI2019.05
We study two simple yet general complexity classes, which provide a unifying framework for ecient query evaluation in areas like graph databases and information extraction, among others. We investigate the complexity of three fundamental algorithmic problems for these classes: enumeration, counting and uniform generation of solutions, and show that they have several desirable properties in this respect. Both complexity classes are defined in terms of non deterministic logarithmic-space transducers (NL transducers). For the first class, we consider the case of unambiguous NL transducers, and we prove constant delay enumeration, and both counting and uniform generation of solutions in polynomial time. For the second class, we consider unrestricted NL transducers, and we obtain polynomial delay enumeration, approximate counting in polynomial time, and polynomialtime randomized algorithms for uniform generation. More specifically, we show that each problem in this second class admits a fully polynomial-time randomized approximation scheme (FPRAS) and a polynomial-time Las Vegas algorithm (with preprocessing) for uniform generation. Remarkably, the key idea to prove these results is to show that the fundamental problem #NFA admits an FPRAS, where #NFA is the problem of counting the number of strings of length n (given in unary) accepted by a non-deterministic finite automaton (NFA). While this problem is known to be #P-complete and, more precisely, SpanL-complete, it was open whether this problem admits an FPRAS. In this work, we solve this open problem, and obtain as a welcome corollary that every function in SpanL admits an FPRAS.
We propose and investigate a concrete numerical measure of the inconsistency of a database with respect to a set of integrity constraints. It is based on a database repair semantics associated to cardinality-repairs. More specifically, it is shown that the computation of this measure can be intractable in data complexity, but answer-set programs are exhibited that can be used to compute it. Furthermore, its is established that there are polynomial-time deterministic and randomized approximations. The behavior of this measure under small updates is analyzed, obtaining fixed-parameter tractability results. We explore abstract extensions of this measure that appeal to generic classes of database repairs. Inconsistency measures and repairs at the attribute level are investigated as a particular, but relevant and natural case.
User satisfaction is an important aspect to consider in any public transport system, and as such, regular and sound measurements of its levels are fundamental. However, typical evaluation schemes involve costly and time-consuming surveys. As a consequence, their frequency is not enough to properly and timely characterize the satisfaction of the users. In this paper, we propose a methodology, based on Twitter data, to capture the satisfaction of a large mass of users of public transport, allowing us to improve the characterization and location of their satisfaction level. We analyzed a massive volume of tweets referring to the public transport system in Santiago, Chile (Transantiago) using text mining techniques, such as sentiment analysis and topic modeling, in order to capture and group bus users’ expressions. Results show that, although the level of detail and variety of answers obtained from surveys are higher than the ones obtained by our method, the amount of bus stops and bus services covered by the proposed scheme is larger. Moreover, the proposed methodology can be effectively used to diagnose problems in a timely manner, as it is able to identify and locate trends, and issues related to bus operating firms, whereas surveys tend to produce average answers. Based on the consistency and logic of the results, we argue that the proposed methodology can be used as a valuable complement to surveys, as both present different, but compatible characteristics.
https://doi.org/10.1109/ACCESS.2019.2915107
AffectiveTweets is a set of programs for analyzing emotion and sentiment of social media messages such as tweets. It is implemented as a package for the Weka machine learning workbench and provides methods for calculating state-of-the-art affect analysis features from tweets that can be fed into machine learning algorithms implemented in Weka. It also implements methods for building affective lexicons and distant supervision methods for training affective models from unlabeled tweets. The package was used by several teams in the shared tasks: EmoInt 2017 and Affect in Tweets SemEval 2018 Task 1.
Due to advances in digitization technology, documentation efforts and digital library systems, increasingly large collections of visual Cultural Heritage (CH) object data becomes available, offering rich opportunities for domain analysis, e.g., for comparing, tracing and studying objects created over time. In principle, existing shape- and image-based similarity search methods can aid such domain analysis tasks. However, in practice, visual object data are given in different modalities, including 2D, 3D, sketches or conventional drawings like profile sections or unwrappings. In addition, collections may be distributed across different publications and repositories, posing a challenge for implementing encompassing search and analysis systems. We introduce a methodology and system for cross-modal visual search in CH object data. Specifically, we propose a new query modality based on 3D views enhanced by user sketches (3D+sketch). This allows for adding new context to the search, which is useful e.g., for searching based on incomplete query objects, or for testing hypotheses on existence of certain shapes in a collection. We present an appropriately designed workflow for constructing query views from incomplete 3D objects enhanced by a user sketch based on shape completion and texture inpainting. Visual cues additionally help users compare retrieved objects with the query. We apply our method on a set of relevant 3D and view-based CH object data, demonstrating the feasibility of our approach and its potential to support analysis of domain experts in Archaeology and the field of CH in general.
https://doi.org/10.2312/3dor.20191057
The Bitcoin protocol and its underlying cryptocurrency have started to shape the way we view digital currency, and opened up a large list of new and interesting challenges. Amongst them, we focus on the question of how is the price of digital currencies affected, which is a natural question especially when considering the price rollercoaster we witnessed for bitcoin in 2017-2018. We work under the hypothesis that price is affected by the web footprint of influential people, we refer to them as crypto-influencers.
In this paper we provide neural models for predicting bitcoin price. We compare what happens when the model is fed only with recent price history versus what happens when fed, in addition, with a measure of the positivity or negativity of the sayings of these influencers, measured through a sentiment analysis of their twitter posts. We show preliminary evidence that twitter data should indeed help to predict the price of bitcoin, even though the measures we use in this paper have a lot of room for refinement. In particular, we also discuss the challenges of measuring the correct sensation of these posts, and discuss the work that should help improving our discoveries even further.
The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with a corresponding unambiguous entry in a Knowledge Base. The evaluation of EL systems relies on the comparison of their results against gold standards. A common format used to represent gold standard datasets is the NLP Interchange Format (NIF), which uses RDF as a data model. However, creating gold standard datasets for EL is a time-consuming and error-prone process. In this paper we propose a tool called NIFify to help manually generate, curate, visualize and validate EL annotations; the resulting tool is useful, for example, in the creation of gold standard datasets. NIFify also serves as a benchmark tool that enables the assessment of EL results. Using the validation features of NIFify, we further explore the quality of popular EL gold standards.
Complex human behaviors related to crime require multiple sources of information to understand them. Social Media is a place where people share opinions and news. This allows events in the physical world like crimes to be reflected on Social Media. In this paper we study crimes from the perspective of Social Media, specifically car theft and Twitter. We use data of car theft reports from Twitter and car insurance companies in Chile to perform a temporal analysis. We found that there is an increasing correlation in recent years between the number of car theft reports in Twitter and data collected from insurance companies. We performed yearly, monthly, daily and hourly analyses. Though Twitter is an unstructured source and very noisy, it allows you to estimate the volume of thefts that are reported by the insurers. We experimented with a Moving Average to predict the tendency in the number of car theft reported to insurances using Twitter data and found that one month is the best time window for prediction.
Recent work suggests that certain places can be more attractive for car theft based how many people regularly visit them, as well as other factors. In this sense, we must also consider the city or district itself where vehicles are stolen. All cities have different cultural and socioeconomic characteristics that influence car theft patterns. In particular, the distribution of public services and places attract a large crowd could play a key role in the occurrence of car theft. Santiago, a city that displays drastic socioeconomic differences among its districts, presents increasingly-high car theft rates. This represents a serious issue for the city, as for any other major city, which –at least for Santiago– has not been analyzed in depth using quantitative approaches. In this work, we present a preliminary study of how places that create social interest, such as restaurants, bars, schools, and shopping malls, increase car theft frequency in Santiago. We also study if some types of places are more attractive than others for this type of crime. To evaluate this, we propose to analyze car theft points (CTP) from insurance companies and their relationship with places of social interest (PSI) extracted from Google Maps, using a proximity based approach. Our findings show a high correlation between CTP and PSI for all of the social interest categories that we studied in the different districts of the Santiago. In particular our work contributes to the understanding of the social factors that are associated to car thefts.
The world of video games has changed considerably over the recent years. Its diversification has dramatically increased the number of users engaged in online communities of this entertainment area, and consequently, the number and types of games available. This context of information overload underpins the development of recommender systems that could leverage the information that the video game platforms collect, hence following the trend of new games coming out every year. In this work we test the potential of state-of-the-art recommender models based respectively on Factorization Machines (FM), deep neural networks (DeepNN) and one derived from the mixture of both (DeepFM), chosen for their potential of receiving multiple inputs as well as different types of input variables. We evaluate our results measuring the ranking accuracy of the recommendation and the diversity/novelty of a recommendation list. All the algorithms achieve better results than a baseline based on implicit feedback (Alternating Least Squares model). The best performing algorithm is DeepNN, the high order interactions are more important than the low order ones for this recommendation task. We also analyze the effect of the sentiment extracted directly from game reviews, and find that it is not as relevant for recommendation as one might expect. We are the first in studying the aforementioned recommender systems over the context of online video game platforms, reporting novel results which could be used as baseline in future works.
Graph serialization is very important for the development of graph-oriented applications. In particular, serialization methods are fundamental in graph data management to support database exchange, benchmarking of systems, and data visualization. This paper presents YARS-PG, a data format for serializing property graphs. YARS-PG was designed to be simple, extensible and platform independent, and to support all the features provided by the current database systems based on the property graph data model.
Chile is a country with high levels of digital news consumption but decreasing levels of confidence in journalism and traditional news media outlets. In a place where natural disasters are common, Chilean citizens usually turn to digital and social media to find out more information about how events unfold. By relying on in-depth interviews with reporters who covered the 2014 earthquake in northern Chile, this study examines how Chilean journalists approached a highly politicized natural disaster. Results show that reporters covered the earthquake as a political issue due to editorial prompting, and they used social media as another way to get close to the sources they already know, but not to look for alternative sources. The implications of these findings for media scholars and practitioners relate to the normalization of social media use among journalists, and the influence of a news outlet’s political leaning on journalistic practices.
https://isoj.org/research/exposing-the-president-the-political-angle-of-a-natural-disaster-in-chile/
This article presents WekaCoin, a peer-to-peer cryptocurrency based on a new distributed consensus protocol called Proof-of-Learning. Proof-of-learning achieves distributed consensus by ranking machine learning systems for a given task. The aim of this protocol is to alleviate the computational waste involved in hashing-based puzzles and to create a public distributed and verifiable database of state-of-the-art machine learning models and experiments.
Different pattern recognition techniques such as clustering, k-nearest neighbor classification, or instance reduction algorithms require prototypes to represent pattern classes. In many applications, strings are used to encode instances, for example, in contour representations or in biological data such as DNA, RNA, and protein sequences. Median strings have been used as representatives of a set of strings in different domains. Finding the median string is an NP-Complete problem for several formulations. Alternatively, heuristic approaches that iteratively refine an initial coarse solution by applying edit operations have been proposed. We propose here a novel algorithm that outperforms the state of the art heuristic approximations to the median string in terms of convergence speed by estimating the effect of a perturbation in the minimization of the expressions that define the median strings. We present comparative experiments to validate these results.
Recommender Systems help us deal with information overload by suggesting relevant items based on our personal preferences. Although there is a large body of research in areas such as movies or music, artwork recommendation has received comparatively little attention, despite the continuous growth of the artwork market. Most previous research has relied on ratings and metadata, and a few recent works have exploited visual features extracted with deep neural networks (DNN) to recommend digital art. In this work, we contribute to the area of content-based artwork recommendation of physical paintings by studying the impact of the aforementioned features (artwork metadata, neural visual features), as well as manually-engineered visual features, such as naturalness, brightness and contrast. We implement and evaluate our method using transactional data from UGallery.com, an online artwork store. Our results show that artwork recommendations based on a hybrid combination of artist preference, curated attributes, deep neural visual features and manually-engineered visual features produce the best performance. Moreover, we discuss the trade-off between automatically obtained DNN features and manually-engineered visual features for the purpose of explainability, as well as the impact of user profile size on predictions. Our research informs the development of next-generation content-based artwork recommenders which rely on different types of data, from text to multimedia.
We present D-v2v, a new dynamic (one-pass) variable-to-variable compressor. Variable-to-variable compression aims at using a modeler that gathers variable-length input symbols and a variable-length statistical coder that assigns shorter codewords to the more frequent symbols. In D-v2v, we process the input text word-wise to gather variable-length symbols that can be either terminals (new words) or non-terminals, subsequences of words seen before in the input text. Those input symbols are set in a vocabulary that is kept sorted by frequency. Therefore, those symbols can be easily encoded with dense codes. Our D-v2v permits real-time transmission of data, i.e. compression/transmission can begin as soon as data become available. Our experiments show thatD-v2vis able to overcome the compression ratios of the v2vDC, the state-of-the-art semi-static variable-to-variable compressor, and to almost reach p7zip values. It also draws a competitive performance at both compression and decompression.
The raster model is widely used in Geographic Information Systems to represent data that vary continuously in space, such as temperatures, precipitations, elevation, among other spatial attributes. In applications like weather forecast systems, not just a single raster, but a sequence of rasters covering the same region at different timestamps, known as a raster time series, needs to be stored and queried. Compact data structures have proven successful to provide space-efficient representations of rasters with query capabilities. Hence, a naive approach to save space is to use such a representation for each raster in a time series. However, in this paper, we show that it is possible to take advantage of the temporal locality that exists in a raster time series to reduce the space necessary to store it while keeping competitive query times for several types of queries
Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog, the so-called warded Datalog±, under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a practical point of view, this allows us to handle the bag semantics of Datalog by powerful, existing query engines for the required extension of Datalog. This use of Datalog± is extended to give a set semantics to duplicates in Datalog± itself. We investigate the properties of the resulting Datalog± programs, the problem of deciding multiplicities, and expressibility of some bag operations. Moreover, the proposed translation has the potential for interesting applications such as to Multiset Relational Algebra and the semantic web query language SPARQL with bag semantics.
Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and generalize. Moreover, there are no general techniques for evaluating CEP query languages with clear performance guarantees. In this paper we embark on the task of giving a rigorous and efficient framework to CEP. We propose a formal language for specifying complex events, called CEL, that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. With a well-defined semantics at hand, we discuss how to efficiently process complex events by evaluating CEL formulas with unary filters. We start by studying the syntactical properties of CEL and propose rewriting optimization techniques for simplifying the evaluation of formulas. Then, we introduce a formal computational model for CEP, called complex event automata (CEA), and study how to compile CEL formulas with unary filters into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by constant-delay enumeration of the results. Finally, we gather the main results of this work to present an efficient and declarative framework for CEP.
Recommender systems are frequently evaluated using performance indexes based on variants and extensions of precision-like measures. As these measures are biased toward popular items, a list of recommendations simply must include a few popular items to perform well. To address the popularity bias challenge, new approaches for novelty and diversity evaluation have been proposed. On the one hand, novelty-based approaches model the quality of being new as apposed to that which is already known. Novelty approaches are commonly based on item views or user rates. On the other hand, diversity approaches model the quality of an item that is composed of different content elements. Diversity measures are commonly rooted in content-based features that characterize the diversity of the content of an item in terms of the presence/absence of a number of predefined nuggets of information. As item contents are also biased to popular contents (e.g., drama in movies or pop in music), diversity-based measures are also popularity biased. To alleviate the effect of popularity bias on diversity measures, we used an evaluation approach based on the degree of novelty of the elements that make up each item. We named this approach content novelty, as it mixes content and diversity approaches in a single and coherent evaluation framework. Experimental results show that our proposal is feasible and useful. Our findings demonstrate that the proposed measures yield consistent and interpretable results, producing insights that reduce the impact of popularity bias in the evaluation of recommender systems.
https://doi.org/10.1007/s10844-019-00548-x
The ubiquitous Internet of Things (IoT) devices nowadays are generating various and numerous data from everywhere at any time. Since it is not always necessary to centralize and analyze IoT data cumulatively (e.g., the Monte Carlo analytics and Convergence analytics demonstrated in this article), the traditional implementations of big data analytics (BDA) will suffer from unnecessary and expensive data transmissions as a result of the tight coupling between computing resource management and data processing logic. Inspired by software-defined infrastructure (SDI), we propose the “micro service-oriented platform” to break the environmental monolith and further decouple data processing logics from their underlying resource management in order to facilitate BDA implementations in the IoT environment (which we name “IoBDA”). Given predesigned standard microservices with respect to specific data processing logics, the proposed platform is expected to largely reduce the complexity in and relieve inexperienced practices of IoBDA implementations. The potential contributions to the relevant communities include (1) new theories of a micro service-oriented platform on top of SDI and (2) a functional micro service-oriented platform for IoBDA with a group of predesigned microservices
One of the important roles of the mass media is the setting of agendas in daily life. What is emphasized in the media, whether traditional or digital, has been found to have a profound impact on not only what people think, but the salience of the issues at any given point in time. This chapter reviews the theory behind agenda-setting and the variables that form, shape, and prime the public’s opinions, attitudes, and behaviors. The chapter also examines the new media and how it impacts on agenda-setting theory and research.
There are very few works about explaining content-based recommendations of images in the artistic domain. Current works do not provide a perspective of the many variables involved in the user perception of several aspects of the system such as domain knowledge, relevance, explainability, and trust. In this paper, we aim to fill this gap by studying three interfaces, with different levels of explainability, for artistic image recommendation. Our experiments with N=121 users confirm that explanations of recommendations in the image domain are useful and increase user satisfaction, perception of explainability and relevance. Furthermore, our results show that the observed effects are also dependent on the underlying recommendation algorithm used. We tested two algorithms: Deep Neural Networks (DNN), which has high accuracy, and Attractiveness Visual Features (AVF) with high transparency but lower accuracy. Our results indicate that algorithms should not be studied in isolation, but rather in conjunction with interfaces, since both play a significant role in the perception of explainability and trust for image recommendation. Finally, using the framework by Knijnenburg et al., we provide a comprehensive model which synthesizes the effects between different variables involved in the user experience with explainable visual recommender systems of artistic images.
Conditional Cash Transfers (CCTs) have spread through the developing world in the past two decades. It is often assumed that CCTs enjoy political support in the population precisely because they impose conditions on beneficiaries. This article employs survey experiments in Brazil and Turkey to determine whether, and in what contexts, making government transfers conditional on the behavior of beneficiaries increases political support for the programs. Results show that conditional transfers are only marginally more popular than similar unconditional transfers in nationally representative samples, but that this difference is substantially larger among the better-off and among those primed to think of themselves as different from beneficiaries. These findings imply that conditionalities per se are not as strong a determinant of support for transfers as the literature suggests, but that they can still be helpful in building support for transfers among subsets of the population that are least likely to support them.
Tag clouds have been utilized as a “social” way to find and visualize information, providing both one-click access and a snapshot of the “aboutness” of a tagged collection. While many research projects have explored and compared various tag artifacts using information theory and simulations, fewer studies have been conducted to compare the effectiveness of different tag-based browsing interfaces from the user’s point of view. This research aims to investigate how users utilize tags in image search context and to what extent different organizations of tag browsing interfaces are useful for image search. We conducted two experiments to explore user behavior and performance with three interfaces: two tag-enabled interfaces (the regular and faceted tag-clouds) and a baseline (search-only) interface. Our results demonstrate the value of tags in the image search context, the role of tags in the exploratory search, and the strengths of two kinds of tag organization explored in this paper.
The Modified Mercalli intensity scale (Mercalli scale for short) is a qualitative measure used to express the perceived intensity of an earthquake in terms of damages. Accurate intensity reports are vital to estimate the type of emergency response required for a particular earthquake. In addition, Mercalli scale reports are needed to estimate the possible consequences of strong earthquakes in the future, based on the effects of previous events. Emergency offices and seismological agencies worldwide are in charge of producing Mercalli scale reports for each affected location after an earthquake. However, this task relies heavily on human observers in the affected locations, who are not always available or accurate. Consequently, Mercalli scale reports may take up to hours or even days to be published after an earthquake. We address this problem by proposing a method for early prediction of spatial Mercalli scale reports based on people’s reactions to earthquakes in social networks. By tracking users’ comments about real-time earthquakes, we create a collection of Mercalli scale point estimates at municipality (i.e., state subdivisions) level granularity. We introduce the concept of reinforced Mercalli support, which combines Mercalli scale point estimates with locally supported data (named ‘local support’). We use this concept to provide Mercalli scale estimates for real-world events by providing smooth point estimates using a spatial smoother that incorporates the distribution of municipalities in each affected region. Our method is the first method based on social media that can provide spatial reports of damages in the Mercalli intensity scale. Experimental results show that our method is accurate and provides early spatial Mercalli reports 30 minutes after an earthquake. Furthermore, we show that our method performs well for earthquake spatial detection and maximum intensity prediction tasks. Our findings indicate that social media is a valuable source of spatial information for quickly estimating earthquake damages.
We introduce a compressed data structure for the storage of free trajectories of moving objects that efficiently supports various spatio-temporal queries. Our structure, dubbed GraCT, stores the absolute positions of all the objects at regular time intervals (snapshots) using a k2-tree, which is a space- and time-efficient region quadtree. Positions between snapshots are represented as logs of relative movements and compressed using a grammar-based compressor. The non-terminals of this grammar are enhanced with MBR information to enable fast queries.
The GraCT structure of a dataset occupies less than the raw data compressed with a powerful traditional compressor. Further, instead of requiring full decompression to access the data like a traditional compressor, GraCT supports direct access to object trajectories or to their position at specific time instants, as well as spatial range and nearest-neighbor queries on time instants and/or time intervals.
Compared to traditional methods for storing and indexing spatio-temporal data, GraCT requires two orders of magnitude less space and is competitive in query times. In particular, thanks to its compressed representation, the GraCT structure may reside in main memory in situations where any classical uncompressed index must resort to disk, thereby being one or two orders of magnitude faster.
A key aspect of visual question answering (VQA) models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.
Query processing is an important challenge for spatial databases due to the use of complex data types that represent spatial attributes. In particular, due to the cost of spatial joins, several optimization algorithms based on indexing structures exist. The work in this paper proposes a strategy for semantic query optimization of spatial join queries. The strategy detects queries with empty results and rewrites queries to eliminate unnecessary spatial joins or to replace spatial by thematic joins. This is done automatically by analyzing the semantics imposed by the database schema through topological dependencies and topological referential integrity constraints. In this way, the strategy comes to complement current state-of-art algorithms for processing spatial join queries. The experimental evaluation with real data sets shows that the optimization strategy can achieve a decrease in the time cost of a join query using indexing structures in a spatial database management system (SDBMS).
People’s perceptions of the built environment influence the way they use and navigate it. Understanding these perceptions may be useful to inform the design, management and planning process of public spaces. Recently, several studies have used data collected at a massive scale and machine learning methods to quantify these perceptions, showing promising results in terms of predictive performance. Nevertheless, most of these models can be of little help in understanding users’ perceptions due to the difficulty associated with identifying the importance of each attribute of landscapes. In this work, we propose a novel approach to quantify perceptions of landscapes through discrete choice models, using semantic segmentations of images of public spaces, generated through machine learning algorithms, as explanatory variables. The proposed models are estimated using the Place Pulse dataset, with over 1.2 million perceptual indicators, and are able to provide useful insights into how users perceive the built environment as a function of its features. The models obtained are used to infer perceptual variables in the city of Santiago, Chile, and show they have a significant correlation with socioeconomic indicators.
Alternatives to recurrent neural networks, in particular, architectures based on attention or convolutions, have been gaining momentum for processing input sequences. In spite of their relevance, the computational properties of these alternatives have not yet been fully explored. We study the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer (Vaswani et al., 2017) and the Neural GPU (Kaiser & Sutskever, 2016). We show both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external memory to become Turing complete. Our study also reveals some minimal sets of elements needed to obtain these completeness results.
European politics at the turn of the 19th century saw a dramatic reduction in the number and diversity of polities as the territorial nation-state emerged as the dominant form of political organization. The transformation had a profound impact on the periphery. The study examines how embracing the principle of territoriality transformed relations between settler societies and indigenous peoples in South America. As this shift coincided with independence from Spain, Creole elites rapidly dismantled the remnants of imperial heteronomy, ending centuries of inter-cultural diplomacy. The study illustrates this shift in the case of the “Southern frontier,” where Spain had maintained a practice of treaty making with the Mapuche people since the mid-17th century. This long-standing practice broke down shortly after Chile gained independence in 1818. What followed was a policy of coercive assimilation through military conquest and forced displacement — a policy that settler societies implemented elsewhere in the 19th century. In contrast to explanations that emphasize the spread of capitalist agriculture and racist ideologies, this study argues that territoriality spelled the end of inter-cultural diplomacy along the “Southern frontier.”
This work presents and studies the efficiency problem of mapping GPU threads onto simplex domains. A non-linear map λ(ω) is formulated based on a block-space enumeration principle that reduces the number of thread-blocks by a factor of approximately 2× and 6× for 2-simplex and 3-simplex domains, respectively, when compared to the standard approach. Performance results show that λ(ω) is competitive and even the fastest map when ran in recent GPU architectures such as the Tesla V100, where it reaches up to 1.5× of speedup in 2-simplex tests. In 3-simplex tests, it reaches up to 2.3× of speedup for small workloads and up to 1.25× for larger ones. The results obtained make λ(ω) a useful GPU optimization technique with applications on parallel problems that define all-pairs, all-triplets or nearest neighbors interactions in a 2-simplex or 3-simplex domain.
Graphs are being increasingly adopted as a flexible data model in scenarios (e.g., Google’s Knowledge Graph, Facebook’s Graph API, Wikidata, etc.) where multiple editors are involved in content creation, where the schema is ever changing, where data are incomplete, where the connectivity of resources plays a key rolescenarios where relational models traditionally struggle. But with this flexibility comes a conceptual cost: it can be difficult to summarise and understand, at a high level, the content that a given graph contains. Hence profiling graphs becomes of increasing importance to extract order, a posteriori, from the chaotic processes by which such graphs are generated. This talk will motivate the use of graphs as a data model, abstract recent trends in graph data management, and then turn to the issue of profiling and summarising graphs: what are the goals of such profiling, the principles by which graphs can be summarised, the main techniques by which this can/could be achieved The talk will emphasise the importance of profiling graphs while highlighting a variety of open research questions yet to be tackled.
Jan Van den Bussche, Marcelo Arenas: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018. ACM 2018.
This volume contains the proceedings of PODS 2018, which include a paper for the keynote addressed by Michael Benedikt (University of Oxford), abstracts based on two invited tutorials by Rajeev Raman (University of Leicester) and Arvind Narayanan (Princeton University), and 29 contributions that were selected by the Program Committee for presentation at the symposium.
In addition, this volume also contains papers from our two «Gems of PODS» speakers, Hung Ngo (Relational AI) and Phokion G. Kolaitis (UC Santa Cruz and IBM Research – Almaden). The Gems of PODS is an event, started in 2016, where the goal is to promote understanding of past seminal PODS results to the general audience. The Gems of PODS papers were selected by the Gems of PODS committee consisting of Marcelo Arenas (chair) (Pontificia Universidad Católica de Chile), Tova Milo (Tel Aviv University) and Dan Olteanu (Oxford University).
This year, PODS continued with two submission cycles that were introduced two years ago. The first cycle allowed for the possibility for papers to be revised and resubmitted. For the first cycle, 30 papers were submitted, 7 of which were directly selected for inclusion in the proceedings, and 5 were invited for a resubmission after a revision. The quality of most of the revised papers increased substantially with respect to the first submission, and 4 out of 5 revised papers were selected for the proceedings. For the second cycle, 53 papers were submitted, 18 of which were selected, resulting in 29 papers selected overall from a total number of 83 submissions.
An important task for the Program Committee has been the selection of the PODS 2018 Best Paper Award. The committee selected the paper: «Entity Matching with Active Monotone Classification» by Yufei Tao. On behalf of the committee, we would like to extend our sincere congratulations to the author!
Since 2008, PODS assigns the ACM PODS Alberto O. Mendelzon Test-of-Time Award to a paper or a small number of papers published in the PODS proceedings ten years prior that had the most impact over the intervening decade. This year’s committee, consisting of Maurizio Lenzerini, Wim Martens and Nicole Schweikardt, selected the following paper: «The Chase Revisited» by Alin Deutsch, Alan Nash and Jeff Remmel.
Recent availability of data about writing processes at keystroke-granularity has enabled research on the evolution of document writing. A natural task is to develop systems that can actually show this data, that is, user interfaces that transform the data of the process of writing –today a black box– into intelligible forms. On this line, we propose a data structure that captures a document’s fine-grained history and an organic visualization that serves as an interface to it. We evaluate a proof-of-concept implementation of the system through a pilot study using documents written by students at a public university. Our results are promising and reveal facets such as general strategies adopted, local edition density and hierarchical structure of the final text.
In this paper, we propose a novel data-driven schema for large-scale heterogeneous knowledge graphs inspired by Formal Concept Analysis (FCA). We first extract the sets of properties associated with individual entities; these property sets (aka. characteristic sets) are annotated with cardinalities and used to induce a lattice based on set-containment relations, forming a natural hierarchical structure describing the knowledge graph. We then propose an algebra over such schema lattices, which allows to compute diffs between lattices (for example, to summarise the changes from one version of a knowledge graph to another), to add lattices (for example, to project future changes), and so forth. While we argue that this lattice structure (and associated algebra) may have various applications, we currently focus on the use-case of modelling and predicting the dynamic behaviour of knowledge graphs. Along those lines, we instantiate and evaluate our methods for analysing how versions of the Wikidata knowledge graph have changed over a period of 11 weeks. We propose algorithms for constructing the lattice-based schema from Wikidata, and evaluate their efficiency and scalability. We then evaluate use of the resulting schema(ta) for predicting how the knowledge graph will evolve in future versions.
Link: https://doi.org/10.1145/3178876.3186016
Although the amount of RDF data has been steadily increasing over the years, the majority of information on the Web is still residing in other formats, and is often not accessible to Semantic Web services. A lot of this data is available through APIs serving JSON documents. In this work we propose a way of extending SPARQL with the option to consume JSON APIs and integrate the obtained information into SPARQL query answers, thus obtaining a query language allowing to bring data from the “traditional” Web to the Semantic Web. Looking to evaluate these queries as efficiently as possible, we show that the main bottleneck is the amount of API requests, and present an algorithm that produces “worst-case optimal” query plans that reduce the number of requests as much as possible. We also do a set of experiments that empirically confirm the optimality of our approach.
Graph data management concerns the research and development of powerful technologies for storing, processing and analyzing large volumes of graph data. This chapter presents an overview about the foundations and systems for graph data management. Specifically, we present a historical overview of the area, studied graph database models, characterized essential graph-oriented queries, reviewed graph query languages, and explore the features of current graph data management systems (i.e. graph databases and graph-processing frameworks).
Representing the movements of objects (trips) over a network in a compact way while retaining the capability of exploiting such data effectively is an important challenge of real applications. We present a new Compact Trip Representation (CTR) that handles the spatio-temporal data associated with users’ trips over transportation networks. Depending on the network and types of queries, nodes in the network can represent intersections, stops, or even street segments.
CTR represents separately sequences of nodes and the time instants when users traverse these nodes. The spatial component is handled with a data structure based on the well-known Compressed Suffix Array (CSA), which provides both a compact representation and interesting indexing capabilities. The temporal component is self-indexed with either a Hu–Tucker-shaped Wavelet-Tree or a Wavelet Matrix that solve range-interval queries efficiently. We show how CTR can solve relevant counting-based spatial, temporal, and spatio-temporal queries over large sets of trips. Experimental results show the space requirements (around 50–70% of the space needed by a compact non-indexed baseline) and query efficiency (most queries are solved in the range of 1–1000 µs) of CTR.
Highlights: -We provide a representation for trips over networks and answer counting-based queries. -We adapt a Compressed Suffix Array to deal with the spatial component of trips. -We use a wavelet matrix or a Hu–tucker-shaped Wavelet Tree for the temporal component. -Experiments show space needs until a 50% when compared with a plain representation. -Experiments show counting-based query-times typically within 1–1000 µs.
The aim of this research is to measure, using smart card data, how much do public transport users change their behavior through time. To quantify the change in behavior, we split the smart card records of a user into a set of separated time windows. Then, we measure the variation between each pair of time windows. Three algorithms that calculate the variation in users’ mobility are assessed. Using data from a Canadian transport system, we show that measuring the stability of user behavior at an individual level provides new insights for public transport operators, e.g., it can be used to measure users’ adaptability to changes in the transport system.
The professionalization of politics and the disappearance of party organizations based on activists seems an inescapable trend. This article shows, by studying the Broad Front of Uruguay as a deviant case, the relevance of organizational rules for explaining the reproduction of party activism. Using data from both an online survey of people differing in their levels of engagement with the Broad Front and in-depth interviews with party activists, we show that those with relatively low levels of engagement—“adherents”—and activists differ in their willingness to cooperate with the party and in the time they devote to party activities. Also, we find that reducing the perceived efficacy of political engagement strongly decreases activists’ self-reported willingness to engage with the party, while this reduction has no effect upon adherents. These findings suggest that the design of organizational rules that grant a political role to grassroots organizers can promote party activism.
Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal intervals when they are active. This work explores the use of the Compressed Suffix Array (CSA), a well-known compact and self-indexed data structure in the area of text indexing, to represent large temporal graphs. The new structure, called Temporal Graph CSA (TGCSA), is experimentally compared with the most competitive compact data structures in the state-of-the-art, namely, EdgeLog and CET. The experimental results show that TGCSA obtains a good space-time trade-off. It uses a reasonable space and is efficient for solving complex temporal queries. Furthermore, TGCSA has wider expressive capabilities than EdgeLog and CET, because it is able to represent temporal graphs where contacts on an edge can temporally overlap.
Highlights: -We consider the problem of representing temporal graphs in a compact way. -We can represent temporal graphs with contacts on a same edge that temporally overlap. -We design TGCSA and SHOW how it solves typical temporal queries. -We create a novel representation of Ψ that improves the performance of TGCSA. -We obtain a reasonable space/time tradeoff even on complex temporal queries.
We present a demo of the GraFa faceted search and browsing interface over the Wikidata knowledge graph. We describe the key aspects of the interface, including the types of interactions that the system allows, the ranking schemes employed, and other features to aid usability. We also discuss future plans for improving the system. Online Demo: http://grafa.dcc.uchile.cl/
Link: http://ceur-ws.org/Vol-2180/paper-44.pdf
Docker is the most popular implementation of Operating System virtualization, currently its online registry service (Docker Hub) stores more than 4.5 millions of software images. Using that registry it is possible to download and deploy Docker images as software containers. However, these images only show information of the main software, hiding the dependencies needed to run it. To allow users to track what they deploy into their machines, we developed DockerPedia, a resource that publishes information of the packages within the Docker images as Linked Data.
Currently our resource includes 28% of the most downloaded images from Docker Hub providing information about the software dependencies and its vulnerabilities allowing to easily reproduce the environment in which each image was deployed as well as to check the security of the image without the need to download it.
Link: http://ceur-ws.org/Vol-2180/paper-47.pdf
We demonstrate a system to canonicalise (aka. normalise) SPARQL queries for use-cases such as caching, query log analysis, query minimisation, signing queries, etc. Our canonicalisation method deterministically rewrites a given input query to an equivalent canonical form such that the results for two queries are syntactically (string) equal if and only if they give the same results on any database, modulo variable names. The method is sound and complete for a monotone fragment of SPARQL with selection (equalities), projection, join and union under both set and bag semantics. Considering other SPARQL features (e.g., optional, filter, graph, etc.), the underlying equivalence problem becomes undecidable, where we currently rather support a best-effort canonicalisation for other SPARQL 1.0. features. We demonstrate a prototype of our canonicalisation framework, provide example rewritings, and discuss limitations, use-cases and future work. Demo link: http://qcan.dcc.uchile.cl
One of the challenges of recent RDF-based applications is managing data quality [1], and several systems already provide RDF validation procedures (e.g., https://www. stardog.com/docs/, https://www.topquadrant.com/technology/shacl/). This created the need for a standardized declarative constraint language for RDF, and for mechanisms to detect violations of such constraints. An important step in this direction is SHACL, or Shapes Constraint Language (https://www.w3.org/TR/shacl/) which has become a W3C recommendation in 2017. The SHACL specification however leaves explicitly undefined the validation of recursive constraints. In a previous article [2], we showed that extending the specification’s semantics to accommodate for recursion leads to intractability (in the size of the graph) for the so-called “core constraint components” of SHACL. This result holds for stratified constraints already, which may come as a surprise, considering that stratification guarantees tractability in well-studied recursive languages such as Datalog. Our previous work identified a tractable fragment of SHACL’s core components. In this paper, we propose an alternative approach to gain tractability, retaining all SHACL operators, but strengthening the stratification condition traditionally used in logic programming. More exactly, we introduce a syntactic condition on shape constraints called “strict stratification”, which guarantees that graph validation is in PTIME in combined (i.e. graph and constraints) complexity. We also describe a procedure to perform such validation. The current paper is not self-contained, due to space limitations, but all definitions can be found in our previous article [2] or its online extended version [3].
Entity Linking (EL) associates the entities mentioned in a given input text with their corresponding knowledge-base (KB) entries. A recent EL trend is towards multilingual approaches. However, one may ask: are multilingual EL approaches necessary with recent advancements in machine translation? Could we not simply focus on supporting one language in the EL system and translate the input text to that language? We present experiments along these lines comparing multilingual EL systems with their results over machine translated text.
Link: http://ceur-ws.org/Vol-2180/paper-53.pdf
Traditional video understanding tasks include human action recognition and actor/object semantic segmentation. However, the combined task of providing semantic segmentation for different actor classes simultaneously with their action class remains a challenging but necessary task for many applications. In this work, we propose a new end-to-end architecture for tackling this task in videos. Our model effectively leverages multiple input modalities, contextual information, and multitask learning in the video to directly output semantic segmentations in a single unified framework. We train and benchmark our model on the Actor-Action Dataset (A2D) for joint actor-action semantic segmentation, and demonstrate state-of-the-art performance for both segmentation and detection. We also perform experiments verifying our approach improves performance for zero-shot recognition, indicating generalizability of our jointly learned feature space.
Link: https://doi.org/10.1007/978-3-030-01225-0_43
Explaining automatic recommendations is an active area of research since it has shown an important eect on users’ acceptance over the items recommended. However, there is a lack of research in explaining content-based recommendations of images based on visual features. In this paper, we aim to ll this gap by testing three dierent interfaces (one baseline and two novel explanation interfaces) for artistic image recommendation. Our experiments with N=121 users conrm that explanations of recommendations in the image domain are useful and increase user satisfaction, perception of explainability, relevance, and diversity. Furthermore, our experiments show that the results are also dependent on the underlying recommendation algorithm used. We tested the interfaces with two algorithms: Deep Neural Networks (DNN), with high accuracy but with dicult to explain features, and the more explainable method based on Aractiveness Visual Features (AVF). e beer the accuracy performance –in our case the DNN method– the stronger the positive eect of the explainable interface. Notably, the explainable features of the AVF method increased the perception of explainability but did not increase the perception of trust, unlike DNN, which improved both dimensions. ese results indicate that algorithms in conjunction with interfaces play a signicant role in the perception of explainability and trust for image recommendation. We plan to further investigate the relationship between interface explainability and algorithmic performance in recommender systems.
Representing the trajectories of mobile objects is a hot topic from the widespread use of smartphones and other GPS devices. However, few works have focused on representing trips over public transportation networks (buses, subway, and trains) where user’s trips can be seen as a sequence of stages performed within a vehicle shared with many other users. In this context, representing vehicle journeys reduces the redundancy because all the passengers inside a vehicle share the same arrival time for each stop. In addition, each vehicle journey follows exactly the sequence of stops corresponding to its line, which makes it unnecessary to represent that sequence for each journey.
To solve data management for transportation systems, we designed a conceptual model that gave us a better insight into this data domain and allowed us the definition of relevant terms and the detection of redundancy sources among those data. Then, we designed two compact representations focused on users’ trips (𝖳𝖳𝖢𝖳𝖱) and on vehicle trips (𝖠𝖼𝗎𝗆𝖬), respectively. Each approach owns some strengths and is able to answer some queries efficiently.
We include experimental results over synthetic trips generated from accurate schedules obtained from a real network description (from the bus transportation system of Madrid) to show the space/time trade-off of both approaches. We considered a wide range of different queries about the use of the transportation network such as counting-based/aggregate queries regarding the load of any line of the network at different times.
Link: https://doi.org/10.1007/978-3-030-00479-8_8
The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with corresponding entities in a given knowledge base. While traditional EL approaches have largely focused on English texts, current trends are towards language-agnostic or otherwise multilingual approaches that can perform EL over texts in many languages. One of the obstacles to ongoing research on multilingual EL is a scarcity of annotated datasets with the same text in different languages. In this work we thus propose VoxEL: a manually-annotated gold standard for multilingual EL featuring the same text expressed in five European languages. We first motivate and describe the VoxEL dataset, using it to compare the behaviour of state of the art EL (multilingual) systems for five different languages, contrasting these results with those obtained using machine translation to English. Overall, our results identify how five state-of-the-art multilingual EL systems compare for various languages, how the results of different languages compare, and further suggest that machine translation of input text to English is now a competitive alternative to dedicated multilingual EL configurations.
Timely detection and accurate description of extreme events, such as natural disasters and other crisis situations, are crucial for emergency management and mitigation. Extreme-event detection is challenging, since one has to rely upon reports from human observers appointed to specific geographical areas, or on an expensive and sophisticated infrastructure. In the case of earthquakes, geographically dense sensor networks are expensive to deploy and maintain. Therefore, only some regions-or even countries-are able to acquire useful information about the effects of earthquakes in their own territory. An inexpensive and viable alternative to this problem is to detect extreme real-world events through people’s reactions in online social networks. In particular, Twitter has gained popularity within the scientific community for providing access to real-time “citizen sensor” activity. Nevertheless, the massive amount of messages in the Twitter stream, along with the noise it contains, underpin a number of difficulties when it comes to Twitter-based event detection. We contribute to address these challenges by proposing an online method for detecting unusual bursts in discrete-time signals extracted from Twitter. This method only requires a one-off semisupervised initialization and can be scaled to track multiple signals in a robust manner. We also show empirically how our proposed approach, which was envisioned for generic event detection, can be adapted for worldwide earthquake detection, where we compare the proposed model to the state of the art for earthquake tracking using social media. Experimental results validate our approach as a competitive alternative in terms of precision and recall to leading solutions, with the advantage of implementation simplicity and worldwide scalability.
Link: https://doi.org/10.1109/TMM.2018.2855107
Faceted browsing has become a popular paradigm for user interfaces on the Web and has also been investigated in the context of RDF graphs. However, current faceted browsers for RDF graphs encounter performance issues when faced with two challenges: scale, where large datasets generate many results, and heterogeneity, where large numbers of properties and classes generate many facets. To address these challenges, we propose GraFa: a faceted browsing system for heterogeneous large-scale RDF graphs based on a materialisation strategy that performs an offline analysis of the input graph in order to identify a subset of the exponential number of possible facet combinations that are candidates for indexing. In experiments over Wikidata, we demonstrate that materialisation allows for displaying (exact) faceted views over millions of diverse results in under a second while keeping index sizes relatively small. We also present initial usability studies over GraFa.
Link: https://doi.org/10.1007/978-3-030-00671-6_18
Time series is a common and well-known way for describing temporal data. However, most of the state-of-the-art techniques for analysing time series have focused on generating a representation for a single level of resolution. For analysing of a time series at several levels of resolutions, one would require to compute different representations, one for each resolution level. We introduce a multi-resolution representation for time series based on local trends and mean values. We require the level of resolution as parameter, but it can be automatically computed if we consider the maximum resolution of the time series. Our technique represents a time series using trend-value pairs on each segment belonging to a resolution level. To provide a useful representation for data mining tasks, we also propose dissimilarity measures and a symbolic representation based on the SAX technique for efficient similarity search using a multi-resolution indexing scheme. We evaluate our method for classification and discord discovery tasks over a diversity of data domains, achieving a better performance in terms of efficiency and effectiveness compared with some of the best-known classic techniques. Indeed, for some of the experiments, the time series mining algorithms using our multi-resolution representation were an order of magnitude faster, in terms of distance computations, than the state of the art.
Visual embeddings from Convolutional Neural Networks (CNN) trained on the ImageNet dataset for the ILSVRC challenge have shown consistently good performance for transfer learning and are widely used in several tasks, including image recommendation. However, some important questions have not yet been answered in order to use these embeddings for a larger scope of recommendation domains: a) Do CNNs that perform better in ImageNet are also better for transfer learning in content-based image recommendation?, b) Does fine-tuning help to improve performance? and c) Which is the best way to perform the fine-tuning?
In this paper we compare several CNN models pre-trained with ImageNet to evaluate their transfer learning performance to an artwork image recommendation task. Our results indicate that models with better performance in the ImageNet challenge do not always imply better transfer learning for recommendation tasks (e.g. NASNet vs. ResNet). Our results also show that fine-tuning can be helpful even with a small dataset, but not every fine-tuning works. Our results can inform other researchers and practitioners on how to train their CNNs for better transfer learning towards image recommendation systems.
Link: https://arxiv.org/abs/1807.09870
The progression of action theories is an important problem in knowledge representation. Progression is second-order definable and known to be first-order definable and effectively computable for restricted classes of theories. Motivated by the fact that universal theories with constants (UTCs) are expressive and natural theories whose satisfiability is decidable, in this paper we provide a thorough study of the progression of situation calculus UTCs. First, we prove that progression of a (possibly infinite) UTC is always first-order definable and results in a UTC. Though first-order definable, we show that the progression of a UTC may be infeasible, that is, it may result in an infinite UTC that is not equivalent to any finite set of first-order sentences. We then show that deciding whether %or not there is a feasible progression of a UTC is undecidable. Moreover, we show that deciding whether %or not a sentence (in an expressive fragment of first-order logic) is in the progression of a UTC is CONEXPTIME-complete, and that there exists a family of UTCs for which the size of every feasible progression grows exponentially. Finally, we discuss resolution-based approaches to compute the progression of a UTC. This comprehensive analysis contributes to a better understanding of progression in action theories, both in terms of feasibility and difficulty.
Link: https://aaai.org/ocs/index.php/KR/KR18/paper/view/18074/17173
Blank nodes in RDF graphs can be used to represent values known to exist but whose identity remains unknown. A prominent example of such usage can be found in the Wikidata dataset where, e.g., the author of Beowulf is given as a blank node. However, while SPARQL considers blank nodes in a query as existentials, it treats blank nodes in RDF data more like constants. Running SPARQL queries over datasets with unknown values may thus lead to counter-intuitive results, which may make the standard SPARQL semantics unsuitable for datasets with existential blank nodes. We thus explore the feasibility of an alternative SPARQL semantics based on certain answers. In order to estimate the performance costs that would be associated with such a change in semantics for current implementations, we adapt and evaluate approximation techniques proposed in a relational database setting for a core fragment of SPARQL. To further understand the impact that such a change in semantics may have on query solutions, we analyse how this new semantics would affect the results of user queries over Wikidata.
Caching in the context of expressive query languages such as SPARQL is complicated by the difficulty of detecting equivalent queries: deciding if two conjunctive queries are equivalent is NP-complete, where adding further query features makes the problem undecidable. Despite this complexity, in this paper we propose an algorithm that performs syntactic canonicalisation of SPARQL queries such that the answers for the canonicalised query will not change versus the original. We can guarantee that the canonicalisation of two queries within a core fragment of SPARQL (monotone queries with select, project, join and union) is equal if and only if the two queries are equivalent; we also support other SPARQL features but with a weaker soundness guarantee: that the (partially) canonicalised query is equivalent to the input query. Despite the fact that canonicalisation must be harder than the equivalence problem, we show the algorithm to be practical for real-world queries taken from SPARQL endpoint logs, and further show that it detects more equivalent queries than when compared with purely syntactic methods. We also present the results of experiments over synthetic queries designed to stress-test the canonicalisation method, highlighting difficult cases.
With the popularity of RDF as an independent data model came the need for specifying constraints on RDF graphs, and for mechanisms to detect violations of such constraints. One of the most promising schema languages for RDF is SHACL, a recent W3C recommendation. Unfortunately, the specification of SHACL leaves open the problem of validation against recursive constraints. This omission is important because SHACL by design favors constraints that reference other ones, which in practice may easily yield reference cycles.
In this paper, we propose a concise formal semantics for the so-called “core constraint components” of SHACL. This semantics handles arbitrary recursion, while being compliant with the current standard. Graph validation is based on the existence of an assignment of SHACL “shapes” to nodes in the graph under validation, stating which shapes are verified or violated, while verifying the targets of the validation process. We show in particular that the design of SHACL forces us to consider cases in which these assignments are partial, or, in other words, where the truth value of a constraint at some nodes of a graph may be left unknown.
Dealing with recursion also comes at a price, as validating an RDF graph against SHACL constraints is NP-hard in the size of the graph, and this lower bound still holds for constraints with stratified negation. Therefore we also propose a tractable approximation to the validation problem.
Two-level indexes have been widely used to handle trajectories of moving objects that are constrained to a network. The top-level of these indexes handles the spatial dimension, whereas the bottom level handles the temporal dimension. The latter turns out to be an instance of the interval-intersection problem, but it has been tackled by non-specialized spatial indexes. In this work, we propose the use of a compact data structure on the bottom level of these indexes. Our experimental evaluation shows that our approach is both faster and smaller than existing solutions.
Link: https://doi.org/10.1007/978-3-030-00479-8_28
Cost register automata (CRA) and its subclass, copyless CRA, were recently proposed by Alur et al. as a new model for computing functions over strings. We study some structural properties, expressiveness, and closure properties of copyless CRA. We show that copyless CRA is strictly less expressive than weighted automata and is not closed under reverse operation. To find a better class we impose restrictions on copyless CRA, which ends successfully with a new robust computational model that is closed under reverse and other extensions.
We present the first solution to τ-majorities on tree paths. Given a tree of n nodes, each with a label from [1..σ], and a fixed threshold 0<τ<1, such a query gives two nodes u and v and asks for all the labels that appear more than τ⋅|Puv| times in the path Puv from u to v, where |Puv| denotes the number of nodes in Puv. Note that the answer to any query is of size up to 1/τ. On a w-bit RAM, we obtain a linear-space data structure with O((1/τ)log∗nloglogwσ) query time. For any κ>1, we can also build a structure that uses O(nlog[κ]n) space, where log[κ]n denotes the function that applies logarithm κ times to n, and answers queries in time O((1/τ)loglogwσ). The construction time of both structures is O(nlogn). We also describe two succinct-space solutions with the same query time of the linear-space structure. One uses 2nH+4n+o(n)(H+1) bits, where H≤lgσ is the entropy of the label distribution, and can be built in O(nlogn) time. The other uses nH+O(n)+o(nH) bits and is built in O(nlogn) time w.h.p.
An important aspect of exploratory search over graph data is to understand what paths connect a given pair of nodes. Since the resulting paths can be manifold, various works propose ranking paths likely to be of interest to a user; these methods often rely on enumerating all such paths (up to a fixed length or number) before ranking is applied. In this paper, we instead propose applying a shortest path search on weighted versions of the graph in order to directly compute the most relevant path(s) between two nodes without fixed-length bounds, further obviating the need to enumerate irrelevant paths. We investigate weightings based on node degree, PageRank and edge frequency, contrasting the paths produced by these schemes over the Wikidata graph and discussing performance issues. Finally we conduct a user study over Wikidata where evaluators assess the quality of the paths produced; though inter-rater consensus on which paths are of most interest is low, we achieve statistically significant results to suggest that users find the weighted shortest paths more interesting than the baseline shortest paths without weights.
Experiment reproducibility is the ability to run an experiment with the introduction of changes to it and getting results that are consistent with the original ones. To allow reproducibility, the scientific community encourages researchers to publish descriptions of the these experiments. However, these recommendations do not include an automated way for creating such descriptions: normally scientists have to annotate their experiments in a semi automated way. In this paper we propose a system to automatically describe computational environments used in in-silico experiments. We propose to use Operating System (OS) virtualization (containerization) for distributing software experiments throughout software images and an annotation system that will allow to describe these software images. The images are a minimal version of an OS (container) that allow the deployment of multiple isolated software packages within it.
Conjunctive queries (CQs) fail to provide an answer when the pattern described by the query does not exactly match the data. CQs might thus be too restrictive as a querying mechanism when data is semistructured or incomplete. The semantic web therefore provides a formalism—known as (projected) well-designed pattern trees (pWDPTs)—that tackles this problem: pWDPTs allow us to formulate queries that match parts of the query over the data if available, but do not ignore answers of the remaining query otherwise. Here we abstract away the specifics of semantic web applications and study pWDPTs over arbitrary relational schemas. Since the language of pWDPTs subsumes CQs, their evaluation problem is intractable. We identify structural properties of pWDPTs that lead to (fixed-parameter) tractability of various variants of the evaluation problem. We also show that checking if a pWDPT is equivalent to one in our tractable class is in 2EXPTIME. As a corollary, we obtain fixed-parameter tractability of evaluation for pWDPTs with such good behavior. Our techniques also allow us to develop a theory of approximations for pWDPTs.
Link: https://doi.org/10.1145/3233983
We present three pumping lemmas for three classes of functions definable by fragments of weighted automata over the min-plus semiring and the semiring of natural numbers. As a corollary we show that the hierarchy of functions definable by unambiguous, finitely-ambiguous, polynomiallyambiguous weighted automata, and the full class of weighted automata is strict for the minplus semiring.
Link: http://drops.dagstuhl.de/opus/volltexte/2018/8498/pdf/LIPIcs-STACS-2018-50.pdf
Domains for spatial and temporal data are often multigranular in nature, possessing a natural order structure defined by spatial inclusion and time-interval inclusion, respectively. This order structure induces lattice-like (partial) operations, such as join, which in turn lead to join rules, in which a single domain element (granule) is asserted to be equal to, or contained in, the join of a set of such granules. In general, the efficient representation of such join rules is a difficult problem. However, there is a very effective representation in the case that the rule is bigranular; i.e., all of the joined elements belong to the same granularity, and, in addition, complete information about the (non)disjointness of all granules involved is known. The details of that representation form the focus of the paper.
Using Chile as a case study for understanding tactical distribution under extensive controls on expenditure, this paper examines whether political motives affect the allocation of funds from the central government to localities. Collecting local-level data of two infrastructure funding programs and using the voting gap percentage between the coalition candidate and opposition competitors in a Sharp Regression Discontinuity methodology, we find causal evidence in favor of three hypotheses: (i) a coalition criterion influences the funding allocation to the local level; (ii) an electoral cycle exists in local funding; and (iii) the degree of coalition targeting varies based on a locality’s history of coalition alignment. In sum, the central government regards politically aligned mayors as valuable electoral assets, especially in municipalities historically aligned with the coalition.
Ontology-based data access (OBDA) is a successful application of knowledge representation and reasoning technologies in information management systems. One premier goal is to facilitate access to data that is heterogeneous and incomplete. This is achieved via an ontology that enriches the user query, typically a union of conjunctive queries, with domain knowledge. It turned out that the ontology and the user query can be seen as two components of one composite query, called ontology-mediated query (OMQ).
The problem of answering OMQs is thus central to OBDA. There is a consensus that the required level of scalability in OMQ answering can be achieved by using standard database management systems. To this end, a standard approach used nowadays is query rewriting: the ontology O and the database query q are combined into a new query qO, the so-called rewriting, which gives the same answer as the OMQ consisting of O and q over all input databases. It is of course essential that the rewriting qO is expressed in a language that can be handled by standard database systems. The typical language that is considered is the class of first-order (FO) queries.
In this work, we focus on two central OMQ languages based on guarded and frontierguarded tuple-generating dependencies (TGDs), and we study the problem whether an OMQ is FO-rewritable, i.e, it can be equivalently expressed as a first-order query. Recall that a guarded (resp., frontier-guarded) TGD is a sentence of the form ∀x, ¯ y¯(φ(¯x, y¯) → ∃z ψ¯ (¯x, z¯)), where φ and ψ are conjunctions of relational atoms, and φ has an atom that contains all the variables (¯x ∪ y¯) (resp., x¯) [1, 8]. Our goal is to develop specially tailored techniques that allow us to understand the above non-trivial problem, and also to pinpoint its computational complexity. To this end, as we discuss below, we follow two different approaches. Our results can be summarized as follows:
-We first focus on the simpler OMQ language based on guarded TGDs and atomic queries, and, in Section 2, we provide a characterization of FO-rewritability that forms the basis for applying tree automata techniques.
-We then exploit, in Section 3, standard two-way alternating parity tree automata. In particular, we reduce our problem to the problem of checking the finiteness of the language of an automaton. The reduction relies on a refined version of the characterization of FO-rewritability established in Section 2. This provides a transparent solution to our problem based on standard tools, but it does not lead to an optimal result.
-Towards an optimal result, we use, in Section 4, a more sophisticated automata model, known as cost automata. In particular, we reduce our problem to the problem of checking the boundedness of a cost automaton. This allows us to show that FOrewritability for OMQs based on guarded TGDs and atomic queries is in 2EXPTIME, and in EXPTIME for predicates of bounded arity. The complexity analysis relies on an intricate result on the boundedness problem for a certain class of cost automata [5, 9].
-Finally, in Section 5, by using the results of Section 4, we provide a complete picture for the complexity of our problem, i.e., deciding whether an OMQ based on (frontier-)guarded TGDs and arbitrary (unions of) conjunctive queries is FO-rewritable.
Link: http://ceur-ws.org/Vol-2100/paper8.pdf
We focus on ontology-mediated queries (OMQs) based on (frontier-)guarded existential rules and (unions of) conjunctive queries, and we investigate the problem of FO-rewritability, i.e., whether an OMQ can be rewritten as a first-order query. We adopt two different approaches. The first approach employs standard two-way alternating parity tree automata. Although it does not lead to a tight complexity bound, it provides a transparent solution based on widely known tools. The second approach relies on a sophisticated automata model, known as cost automata. This allows us to show that our problem is 2EXPTIME-complete. In both approaches, we provide semantic characterizations of FO-rewritability that are of independent interest.
Link: https://doi.org/10.24963/ijcai.2018/236
Automatic synthesis of software from specification is one of the classic problems in computer science. In the last decade, significant advances have been made in the synthesis of programs from specifications expressed in Linear Temporal Logic (LTL). LTL synthesis technology is central to a myriad of applications from the automated generation of controllers for Internet of Things devices, to the synthesis of control software for robotic applications. Unfortunately, the number of existing tools for LTL synthesis is limited, and using them requires specialized expertise. In this paper we present SynKit, a tool that offers LTL synthesis as a service. SynKit integrates a RESTful API and a web service with an editor, a solver, and a strategy visualizer.
This paper provides an overview of a model for capturing properties of client-server-based query computation setups. This model can be used to formally analyze different combinations of client and server capabilities, and compare them in terms of various fine-grain complexity measures. While the motivations and the focus of the presented work are related to querying the Semantic Web, the main concepts of the model are general enough to be applied in other contexts as well.
In this paper, we address the problem of LTL realizability and synthesis. State of the art techniques rely on so-called bounded synthesis methods, which reduce the problem to a safety game. Realizability is determined by solving synthesis in a dual game. We provide a unified view of duality, and introduce novel bounded realizability methods via reductions to reachability games. Further, we introduce algorithms, based on AI automated planning, to solve these safety and reachability games. This is the the first complete approach to LTL realizability and synthesis via automated planning. Experiments illustrate that reductions to reachability games are an alternative to reductions to safety games, and show that planning can be a competitive approach to LTL realizability and synthesis.
Conjunctive query (CQ) evaluation is NP-complete, but becomes tractable for fragments of bounded hypertreewidth. If a CQ is hard to evaluate, it is thus useful to evaluate an approximation of it in such fragments. While underapproximations (i.e., those that return correct answers only) are well-understood, the dual notion of overapproximations that return complete (but not necessarily sound) answers, and also a more general notion of approximation based on the symmetric difference of query results, are almost unexplored. In fact, the decidability of the basic problems of evaluation, identification, and existence of those approximations, is open. We develop a connection with existential pebble game tools that allows the systematic study of such problems. In particular, we show that the evaluation and identification of overapproximations can be solved in polynomial time. We also make progress in the problem of existence of overapproximations, showing it to be decidable in 2EXPTIME over the class of acyclic CQs. Furthermore, we look at when overapproximations do not exist, suggesting that this can be alleviated by using a more liberal notion of overapproximation. We also show how to extend our tools to study symmetric difference approximations. We observe that such approximations properly extend under- and over-approximations, settle the complexity of its associated identification problem, and provide several results on existence and evaluation.
We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.
Link: https://doi.org/10.1145/3183713.3190654
LTL synthesis is the task of generating a strategy that satisfies a Linear Temporal Logic (LTL) specification interpreted over infinite traces. In this paper we examine the problem of LTLf synthesis, a variant of LTL synthesis where the specification of the behaviour of the strategy we generate is interpreted over finite traces — similar to the assumption we make in many planning problems, and important for the synthesis of business processes and other system interactions of finite duration. Existing approaches to LTLf synthesis transform LTLf into deterministic finite-state automata (DFA) and reduce the synthesis problem to a DFA game. Unfortunately, the DFA transformation is worst-case double-exponential in the size of the formula, presenting a computational bottleneck. In contrast, our approach exploits non-deterministic automata, and we reduce the synthesis problem to a non-deterministic planning problem. We leverage our approach not only for strategy generation but also to generate certificates of unrealizability — the first such method for LTLf. We employ a battery of techniques that exploit the structure of the LTLf specification to improve the efficiency of our transformation to automata. We combine these techniques with lazy determinization of automata and on-the-fly state abstraction. We illustrate the effectiveness of our approach on a set of established LTL synthesis benchmarks adapted to finite LTL.
Link: https://aaai.org/ocs/index.php/ICAPS/ICAPS18/paper/view/17790
Many efforts have been dedicated to identifying restrictions on ontologies expressed as tuple-generating dependencies (tgds), a.k.a. existential rules, that lead to the decidability of answering ontology-mediated queries (OMQs). This has given rise to three families of formalisms: guarded, non-recursive, and sticky sets of tgds. We study the containment problem for OMQs expressed in such formalisms, which is a key ingredient for solving static analysis tasks associated with them. Our main contribution is the development of specially tailored techniques for OMQ containment under the classes of tgds stated above. This enables us to obtain sharp complexity bounds for the problems at hand.
Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner’s output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results.
Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, all language proposals introduced so far are designed to output relations, thus making them incapable of handling incomplete information. To remedy the situation, we propose to extend information extraction languages with the ability to use mappings, thus allowing us to work with documents which have missing or optional parts. Using this approach, we simplify the semantics of regex formulas and extraction rules, two previously defined methods for extracting information. We extend them with the ability to handle incomplete data, and study how they compare in terms of expressive power. We also study computational properties of these languages, focusing on the query enumeration problem, as well as satisfiability and containment.
Since its invention, the Web has evolved into the largest multimedia repository that has ever existed. This evolution is a direct result of the explosion of user-generated content, explained by the wide adoption of social network platforms. The vast amount of multimedia content requires effective management and retrieval techniques. Nevertheless, Web multimedia retrieval is a complex task because users commonly express their information needs in semantic terms, but expect multimedia content in return. This dissociation between semantics and content of multimedia is known as the semantic gap. To solve this, researchers are looking beyond content-based or text-based approaches, integrating novel data sources. New data sources can consist of any type of data extracted from the context of multimedia documents, defined as the data that is not part of the raw content of a multimedia file. The Web is an extraordinary source of context data, which can be found in explicit or implicit relation to multimedia objects, such as surrounding text, tags, hyperlinks, and even in relevance-feedback. Recent advances in Web multimedia retrieval have shown that context data has great potential to bridge the semantic gap. In this article, we present the first comprehensive survey of context-based approaches for multimedia information retrieval on the Web. We introduce a data-driven taxonomy, which we then use in our literature review of the most emblematic and important approaches that use context-based data. In addition, we identify important challenges and opportunities, which had not been previously addressed in this area.
Link: https://doi.org/10.1007/s11042-017-4997-y
The news industry has undergone a revolution in the past decade, with substantial changes continuing to this day. News consumption habits are changing due to the increase in the volume of news and the variety of sources. Readers need new mechanisms to cope with this vast volume of information in order to not only find a signal in the noise, but also to understand what is happening in the world given the multiple points of view describing events. These challenges in journalism relate to Information Retrieval (IR) and Natural Language Processing (NLP) fields such as: verification of a source’s reliability; the integration of news with other sources of information; real-time processing of both news content and social streams; de-duplication of stories; and entity detection and disambiguation. Although IR and NLP have been applied to news for decades, the changing nature of the space requires fresh approaches and a closer collaboration with our colleagues from the journalism environment. Following the success of the previous version of the workshop (NewsIR’16), the goal of this workshop, held in conjunction with ECIR 2018, is to continue to stimulate such discussion between the communities and to share interesting approaches to solve real user problems. A total number of 19 submissions were received and reviewed, of which 12 were accepted for presentation. In addition to that, we had over 30 registered participants in the workshop who were pleased to attend the two keynote talks given by well-known experts in the field – Edgar Meij (from industry) and Peter Tolmie (from academia) and oral and poster presentations from the accepted papers. The workshop also included a breakout session to discuss ideas for a future data challenge in news IR and closed with a focused panel discussion to reflect on the day. In summary, several ideas were presented in the workshop on solving complex information needs in the news domain. In addition, the workshop concluded with suggestions of important challenges and shared tasks to work on as a community for News IR.
The Modified Mercalli Intensity Scale is a measure of the severity of an earthquake for a nonscientist. Since the Mercalli scale is based on perceived effects, it has a strong dependence on observers. Typically, these reports take time to be prepared and, as a consequence, Mercalli intensities are published hours after the occurrence of an earthquake. The National Seismological Center of Chile needs to provide a preliminary overview of the observed effects of an earthquake. This has motivated us to create a system for early tracking of people’s reaction in social networks to infer Mercalli intensities. By tracking people’s comments about the effects of an earthquake, a collection of Mercalli point estimates is retrieved at county level of granularity. We introduce the concept of Reinforced Mercalli support that combines Mercalli point estimates with social support, allowing to discard social unsupported estimates. Experimental results show that our proposal is accurate providing early Mercalli reports 30 min after an earthquake, detecting the maximum Mercalli intensity of an event with high accuracy in terms of mean absolute error (MAE).
Link: https://doi.org/10.1007/978-3-319-91485-5_19
In general, existing methods for automatically detecting emergency situations using Twitter rely on features based on domain-specific keywords found in messages. This type of keyword-based methods usually require training on domain-specific labeled data, using multiple languages, and for different types of events (e.g., earthquakes, floods, wildfires, etc.). In addition to being costly, these approaches may fail to detect previously unexpected situations, such as uncommon catastrophes or terrorist attacks. However, collective mentions of certain keywords are not the only type of self-organizing phenomena that may arise in social media when a real-world extreme situation occurs. Just as nearby physical sensors become activated when stimulated, localized citizen sensors (i.e., users) will also react in a similar manner. To leverage this information, we propose to use self-organized activity related to geolocations to identify emergency situations. We propose to detect such events by tracking the frequencies, and probability distributions of the interarrival time of the messages related to specific locations. Using an off-the-shelf classifier that is independent of domain-specific features, we study and describe emergency situations based solely on location-based features in messages. Our findings indicate that anomalies in location-related social media user activity indeed provide information for automatically detecting emergency situations independent of their domain.
Link: https://doi.org/10.1145/3201064.3201077
Some decades have passed since the concept of “named entity” was used for the first time. Since then, new lines of research have emerged in this environment, such as linking the (named) entity mentions in a text collection with their corresponding knowledge-base entries.
However, this introduces problems with respect to a consensus on the definition of the concept of “entity” in the literature. This paper aims to highlight the importance of formalizing the concept of “entity” and the benefits it would bring to the Entity Linking community, in particular relating to the construction of gold standards for evaluation purposes.
Validating RDF graphs against constraints has gained interest in recent years, due to the popularity of RDF and the growth of knowledge bases. SHACL, a constraint language for RDF, has recently become a W3C recommendation, with a specification detailing syntax, semantics and common use cases. Unfortunately, this (otherwise complete) specification does not cover validation against recursive constraints. This omission is important, because SHACL by design favors constraint references. We investigate the possibility of a formal semantics for SHACL which covers the recursive case, while being compliant with the current standard.
Most of the current graph database systems have been designed to support property graphs. Surprisingly, there is no standard specification of the database model behind such systems. This paper presents a formal definition of the property graph database model. Specifically, we define the property graph data structure, basic notions of integrity constraints (e.g. graph schema), and a graph query language.
The use of the join operator in metric spaces leads to what is known as a similarity join, where objects of two datasets are paired if they are somehow similar. We propose an heuristic that solves the 1-NN selfsimilarity join, that is, a similarity join of a dataset with itself, that brings together each element with its nearest neighbor within the same dataset. Solving the problem using a simple brute-force algorithm requires O(n 2 ) distance calculations, since it requires to compare every element against all others. We propose a simple divide-and-conquer algorithm that gives an approximated solution for the self-similarity join that computes only O(n 3 2 ) distances. We show how the algorithm can be easily modified in order to improve the precision up to 31% (i.e., the percentage of correctly found 1-NNs) and such that 79% of the results are within the 10-NN, with no significant extra distance computations. We present how the algorithm can be executed in parallel and prove that using Θ( √ n) processors, the total execution takes linear time. We end discussing ways in which the algorithm can be improved in the future.
AMW 2018, Alberto Mendelzon Workshop on Foundations of Data Management.
Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management Cali, Colombia, May 21-25, 2018.
Edited by
Dan Olteanu, University of Oxford, UK
Bárbara Poblete, University of Chile, Chile.
Graph-based data models [1] have become increasingly common in data management scenarios that require flexibility beyond what is offered by traditional relational databases. Such flexibility is particularly important in Web scenarios, where potentially many users may be involved (either directly or indirectly) in the creation, management, and curation of data. An example of such a scenario is the Wikidata knowledge graph [2] where users can add new properties and types that can be used to define further data. The flip-side of flexibility is higher levels of heterogeneity. Conceptually understanding the current state of a knowledge graph – in terms of what data it contains, what it is missing, how it can be effectively queried, what has changed recently, etc. – is thus a major challenge: it is unclear how to distil an adequate, high-level description that captures an actionable overview of knowledge graphs. We thus need well-founded methodologies to make sense of knowledge graphs, where an obvious approach is to define some notion of schema for such graphs.
The traditional approach in the Semantic Web has been what Pham and Boncz [3] call the schema first approach, which defines the schema that the data should follow. The most established language for specifying such schemas is RDFS. An alternative to the schema first approach is the schema last approach [3], which foregoes an upfront schema and rather lets the data evolve naturally; thereafter, the goal is to understand what the legacy graph data contain by extracting highlevel summaries that characterise the graph, resulting in a data-driven schema. In this paper, we summarise recently published results [4] on a novel approach to compute a data-driven schema from knowledge graphs. We believe that such schemas are useful for understanding what a knowledge graph contains, and how it can be queried, among several other use-cases. Nevertheless, in this work we focus on the use-case of predicting how the knowledge graph will evolve in future versions, which could be used for measuring the time-to-live of cached SPARQL results, identifying missing properties for entities, etc.
We consider the problem of determining whether organizations facing a new data-transformation task can avoid building a new transformation procedure from scratch by reusing their stored procedures. Because it can be difficult to obtain exact descriptions of what stored procedures do, our framework abstracts data-transforming tools as black-box procedures, in which a procedure description indicates the parts of the database that might be modified by the procedure and constraints on the states of the database that must hold before and after the application of this procedure.
In this paper we present our framework and study the problem of determining, given a database and a set of procedures, whether there is a sequence of procedures from this set such that their application to the database results in the satisfaction of a boolean query. This data readiness problem is undecidable in general, but we show decidability for a broad and realistic class of procedures.
GraphQL is a recently proposed, and increasingly adopted, conceptual framework for providing a new type of data access interface on the Web. The framework includes a new graph query language whose semantics has been specified informally only. This has prevented the formal study of the main properties of the language. We embark on the formalization and study of GraphQL. To this end, we first formalize the semantics of GraphQL queries based on a labeled-graph data model. Thereafter, we analyze the language and show that it admits really efficient evaluation methods. In particular, we prove that the complexity of the GraphQL evaluation problem is NL-complete. Moreover, we show that the enumeration problem can be solved with constant delay. This implies that a server can answer a GraphQL query and send the response byte-by-byte while spending just a constant amount of time between every byte sent. Despite these positive results, we prove that the size of a GraphQL response might be prohibitively large for an internet scenario. We present experiments showing that current practical implementations suffer from this issue. We provide a solution to cope with this problem by showing that the total size of a GraphQL response can be computed in polynomial time. Our results on polynomial-time size computation plus the constant-delay enumeration can help developers to provide more robust GraphQL interfaces on the Web.
Despite its importance to the Web, multimedia content is often neglected when building and designing knowledge-bases: though descriptive metadata and links are often provided for images, video, etc., the multimedia content itself is often treated as opaque and is rarely analysed. IMGpedia is an effort to bring together the images of Wikimedia Commons (including visual information), and relevant knowledge-bases such as Wikidata and DBpedia. The result is a knowledge-base that incorporates similarity relations between the images based on visual descriptors, as well as links to the resources of Wikidata and DBpedia that relate to the image. Using the IMGpedia SPARQL endpoint, it is then possible to perform visuo-semantic queries, combining the semantic facts extracted from the external resources and the similarity relations of the images. This paper presents a new web interface to browse and explore the dataset of IMGpedia in a more friendly manner, as well as new visuo-semantic queries that can be answered using 6 million recently added links from IMGpedia to Wikidata. We also discuss future directions we foresee for the IMGpedia project.
We research the problem of building knowledge maps of graph-like information. We live in the digital era and similarly to the Earth, the Web is simply too large and its interrelations too complex for anyone to grasp much of it through direct observation. Thus, the problem of applying cartographic principles also to digital landscapes is intriguing. We introduce a mathematical formalism that captures the general notion of map of a graph and enables its development and manipulation in a semi-automated way. We describe an implementation of our formalism on the Web of Linked Data graph and discuss algorithms that efficiently generate and combine (via an algebra) regions and maps. Finally, we discuss examples of knowledge maps built with a tool implementing our framework.
Tim Berners-Lee, Sarven Capadisli, Stefan Dietze, Aidan Hogan, Krzysztof Janowicz, Jens Lehmann. LDOW@WWW 2018, Lyon, France April 23rd, 2018. CEUR Workshop Proceedings 2073, CEUR-WS.org 2018.
Linear Temporal Logic ( 𝖫𝖳𝖫 ) synthesis can be understood as the problem of building a controller that defines a winning strategy, for a two-player game against the environment, where the objective is to satisfy a given 𝖫𝖳𝖫 formula. It is an important problem with applications in software synthesis, including controller synthesis. In this paper we establish the correspondence between 𝖫𝖳𝖫 synthesis and fully observable non-deterministic (FOND) planning. We study 𝖫𝖳𝖫 interpreted over both finite and infinite traces. We also provide the first explicit compilation that translates an 𝖫𝖳𝖫 synthesis problem to a FOND problem. Experiments with state-of-the-art 𝖫𝖳𝖫 FOND and synthesis solvers show automated planning to be a viable and effective tool for highly structured 𝖫𝖳𝖫 synthesis problems.
Info-boxes provide a summary of the most important meta-data relating to a particular entity described by a Wikipedia article. However, many articles have no info-box or have info-boxes with only minimal information; furthermore, there is a huge disparity between the level of detail available for info-boxes in English articles and those for other languages. Wikidata has been proposed as a central repository of facts to try to address such disparities, and has been used as a source of information to generate info-boxes. However, current processes still rely on human intervention either to create generic templates for entities of a given type or to create a specific info-box for a specific article in a specific language. As such, there are still many articles of Wikipedia without info-boxes but where relevant data are provided by Wikidata. In this paper, we investigate fully automatic methods to generate info-boxes for Wikipedia from the Wikidata knowledge graph. The primary challenge is to create ranking mechanisms that provide an intuitive prioritisation of the facts associated with an entity. We discuss this challenge, propose several straightforward metrics to prioritise information in info-boxes, and present an initial user evaluation to compare the quality of info-boxes generated by various metrics.
Navigational queries over RDF data are viewed as one of the main applications of graph query languages, and yet the standard model of graph databases—essentially labeled graphs—is different from the triples-based model of RDF. While encodings of RDF databases into graph data exist, we show that even the most natural ones are bound to lose some functionality when used in conjunction with graph query languages. The solution is to work directly with triples, but then many properties taken for granted in the graph database context (e.g., reachability) lose their natural meaning.
Our goal is to introduce languages that work directly over triples and are closed, i.e., they produce sets of triples, rather than graphs. Our basic language is called TriAL, or Triple Algebra: it guarantees closure properties by replacing the product with a family of join operations. We extend TriAL with recursion and explain why such an extension is more intricate for triples than for graphs. We present a declarative language, namely a fragment of datalog, capturing the recursive algebra. For both languages, the combined complexity of query evaluation is given by low-degree polynomials. We compare our language with previously studied graph query languages such as adaptations of XPath, regular path queries, and nested regular expressions; many of these languages are subsumed by the recursive triple algebra. We also provide an implementation of recursive TriAL on top of a relational query engine, and we show its usefulness by running a wide array of navigational queries over real-world RDF data, while at the same time testing how our implementation compares to existing RDF systems.
The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.
In the context of protein engineering and biotechnology, the discovery and characterization of structural patterns is very relevant as it can give fundamental insights about protein structures. In this paper we present GSP4PDB, a bioinformatics web tool that lets the users design, search and analyze protein-ligand structural patterns inside the Protein Data Bank (PDB). The novel feature of GSP4PDB is that a protein-ligand structural pattern is graphically designed as a graph such that the nodes represent protein’s components and the edges represent structural relationships. The resulting graph pattern is transformed into a SQL query, and executed in a PostgreSQL database system where the PDB data is stored. The results of the search are presented using a textual representation, and the corresponding binding-sites can be visualized using a JSmol interface.
Community Question Answering (cQA) sites have emerged as platforms designed specifically for the exchange of questions and answers among users. Although users tend to find good quality answers in cQA sites, they also engage in a significant volume of QA interactions in other platforms, such as microblog networking sites. This in part is explained because microblog platforms contain up-to-date information on current events, provide rapid information propagation, and have social trust.
Despite the potential of microblog platforms, such as Twitter, for automatic QA retrieval, how to leverage them for this task is not clear. There are unique characteristics that differentiate Twitter from traditional cQA platforms (e.g., short message length, low quality and noisy information), which do not allow to directly apply prior findings in the area. In this work, we address this problem by studying: (1) the feasibility of Twitter as a QA platform and (2) the discriminating features that identify relevant answers to a particular query. In particular, we create a document model at conversation-thread level, which enables us to aggregate microblog information, and set up a learning-to-rank framework, using factoid QA as a proxy task. Our experimental results show microblog data can indeed be used to perform QA retrieval effectively. We identify domain-specific features and combinations of those features that better account for improving QA ranking, achieving a MRR of 0.7795 (improving 62% over our baseline method). In addition, we provide evidence that our method allows to retrieve complex answers to non-factoid questions.
Link: https://doi.org/10.1007/978-3-319-76941-7_38
Dyaa Albakour, David Corney, Julio Gonzalo, Miguel Martinez, Barbara Poblete, Andreas Valochas: Proceedings of the Second International Workshop on Recent Trends in News Information Retrieval co-located with 40th European Conference on Information Retrieval (ECIR 2018), Grenoble, France, March 26, 2018. CEUR Workshop Proceedings 2079, CEUR-WS.org 2018.
In the Encyclopedia of Big Data Technologies:
A query language is a high-level computer language for the retrieval and modification of data held in databases or files. Query languages usually consist of a collection of operators which can be applied to any valid instances of the data structure types of a data model, in any combination desired.
In the context of graph data management, a graph query language (GQL) defines the way to retrieve or extract data which have been modeled as a graph and whose structure is defined by a graph data model. Therefore, a GQL is designed to support specific graph operations, such as graph pattern matching and shortest path finding.
In the Encyclopedia of Big Data Technologies:
Navigational query languages for graph databases allow to recursively traverse the edges of a graph while checking for the existence of a path that satisfies certain regular conditions. The basic building block of such languages is the class of regular path queries (RPQs), which are expressions that compute the pairs of nodes that are linked by a path whose label satisfies a regular expression. RPQs are often extended with features that turn them more flexible for practical applications, e.g., with the ability to traverse edges in the backward direction (RPQs with inverses) or to express arbitrary patterns over the data (conjunctive RPQs).
We consider the containment problem for regular queries with memory and regular queries with data tests: two recently proposed query languages for graph databases that, in addition to allowing the user to ask topological queries, also track how the data changes along paths connecting various points in the database. Our results show that the problem is undecidable in general. However, by allowing only positive data comparisons we find natural fragments with better static analysis properties: the containment problem is PSpace -complete in the case of regular queries with data tests and ExpSpace -complete in the case of regular queries with memory.
Highlights: -Study of graph query languages that can deal with data values. -Containment problem for these languages is undecidable in general. -Containment is decidable if one focuses on languages that can only check for equalities. -Proofs make use of automata models.
This book constitutes the refereed post-conference proceedings of the 22nd Iberoamerican Congress on Pattern Recognition, CIARP 2017, held in Valparaíso, Chile, in November 2017.
The 87 papers presented were carefully reviewed and selected from 156 submissions. The papers feature research results in the areas of pattern recognition, image processing, computer vision, multimedia and related fields.
Link: https://link.springer.com/book/10.1007%2F978-3-319-75193-1
RDF compression can be defined as the problem of encoding an RDF dataset using less bits than that required by text-based traditional serialization formats like RDF/XML, NTriples, or Turtle, among others. These savings immediately lead to more efficient storage (i.e., archival) and less transmission costs (i.e., less bits over the wire). Although this problem can be easily solved through universal compression (e.g., gzip or bzip2), optimized RDF-specific compressors take advantage of the particular features of RDF datasets (such as semantic redundancies) in order to save more bits or to provide retrieval operations on the compressed information. RDF self-indexes are focused on this latter task.
RDF self-indexes are RDF compressors that provide indexing features in a space close to that of the compressed dataset and can be accessed with no prior (or partial) decompression. These properties enhance scalability (i.e., less resources are required to serve semantic data) and speed up access as more information can be managed in higher levels of the memory hierarchy (typically, main memory or cache). In addition, efficient search algorithms have been proposed to resolve basic queries on top of self-indexed datasets. As a result, RDF self-indexes has been adopted as a core component of semantic search engines and lightweight Linked Data servers.
Finally, RDF stream compressors specifically focus on compressing a (continuous) stream of RDF data in order to improve exchange processes, typically in real-time. This constitutes a more recent trend that exploits different trade-offs between the space savings achieved by the compressor and the latency introduced in the compression/decompression processes.
This entry introduces basic notions of RDF compression, RDF self-indexing, and RDF stream compression and discusses how existing approaches deal with (and remove) redundant information in semantic datasets.
DOI https://doi.org/10.1007/978-3-319-63962-8_62-1
3D objects are an important type of data with many applications in domains such as Engineering and Computer Aided Design, Science, Simulation, Visualization, Cultural Heritage, and Entertainment. Technological progress in acquisition, modeling, processing, and dissemination of 3D geometry leads to the accumulation of large repositories of 3D objects. Consequently, there is a strong need to research and develop technology to support the effective retrieval of 3D object data from 3D repositories.
Link: https://doi.org/10.1007/978-1-4899-7993-3_161-2