Publicaciones
¿Puede una máquina ver mejor que un humano? in: “Inteligencia artificial: aplicaciones de la inteligencia artificial” (2021).
Three popular application domains of sentiment and emotion analysis are: 1) the automatic rating of movie reviews, 2) extracting opinions and emotions on Twitter, and 3) inferring sentiment and emotion associations of words. The textual elements of these domains differ in their length, i.e., movie reviews are usually longer than tweets and words are obviously shorter than tweets, but they also share the property that they can be plausibly annotated according to the same affective categories (e.g., positive, negative, anger, joy). Moreover, state-of-the-art models for these domains are all based on the approach of training supervised machine learning models on manually annotated examples. This approach suffers from an important bottleneck: Manually annotated examples are expensive and time-consuming to obtain and not always available. In this paper, we propose a method for transferring affective knowledge between words, tweets, and movie reviews using two representation techniques: Word2Vec static embeddings and BERT contextualized embeddings. We build compatible representations for movie reviews, tweets, and words, using these techniques, and train and evaluate supervised models on all combinations of source and target domains. Our experimental results show that affective knowledge can be successfully transferred between our three domains, that contextualized embeddings tend to outperform their static counterparts, and that better transfer learning results are obtained when the source domain has longer textual units than the target domain.
We introduce a time- and space-efficient technique to solve regularpath queries over labeled graphs. We combine a bit-parallel simula-tion of the Glushkov automaton of the regular expression with thering index introduced by Arroyuelo et al., exploiting its wavelettree representation of the triples in order to efficiently reach thestates of the product graph that are relevant for the query. Ourquery algorithm is able to simultaneously process several automa-ton states, as well as several graph nodes/labels. Our experimentalresults show that our representation uses 3-5 times less space thanthe alternatives in the literature, while generally outperformingthem in query times (1.67 times faster than the next best).
In this article, we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models, as well as languages used to query and validate knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We conclude with high-level future research directions for knowledge graphs.
Social networks are used every day to report daily events, although the information published in them many times correspond to fake news. Detecting these fake news has become a research topic that can be approached using deep learning. However, most of the current research on the topic is available only for the English language. When working on fake news detection in other languages, such as Spanish, one of the barriers is the low quantity of labeled datasets available in Spanish. Hence, we explore if it is convenient to translate an English dataset to Spanish using Statistical Machine Translation. We use the translated dataset to evaluate the accuracy of several deep learning architectures and compare the results from the translated dataset and the original dataset in fake news classification. Our results suggest that the approach is feasible, although it requires high-quality translation techniques, such as those found in the translation’s neural-based models.
Weakly-Sticky(WS) Datalog+/- is an expressive member of the family of Datalog+/- program classes that is defined on the basis of the conditions of stickiness and weak-acyclicity. Conjunctive query answering (QA) over the WS programs has been investigated, and its tractability in data complexity has been established. However, the design and implementation of practical QA algorithms and their optimizations have been open. In order to fill this gap, we first study Sticky and WS programs from the point of view of the behavior of the chase procedure. We extend the stickiness property of the chase to that of generalized stickiness of the chase (GSCh) modulo an oracle that selects (and provides) the predicate positions where finitely values appear during the chase. Stickiness modulo a selection function S that provides only a subset of those positions defines sch(S), a semantic subclass of GSCh. Program classes with selection functions include Sticky and WS, and another syntactic class that we introduce and characterize, namely JWS, of jointly-weakly-sticky programs, which contains WS. The selection functions for these last three classes are computable, and no external, possibly non-computable oracle is needed. We propose a bottom-up QA algorithm for programs in the class sch(S), for a general selection function S. As a particular case, we obtain a polynomial-time QA algorithm for JWS and weakly-sticky programs. Unlike WS, JWS turns out to be closed under magic-sets query optimization. As a consequence, both the generic polynomial-time QA algorithm and its magic-set optimization can be particularized and applied to WS.
Techniques for presenting objects spatially via density maps have been thoroughly studied, but there is lack of research on how to display this information in the presence of several classes, i.e., multiclass density maps. Moreover, there is even less research on how to design an interactive visualization for comparison tasks on multiclass density maps. One application domain which requires this type of visualization for comparison tasks is crime analytics, and the lack of research in this area results in ineffective visual designs. To fill this gap, we study four types of techniques to compare multiclass density maps, using car theft data. The interactive techniques studied are swipe, translucent overlay, magic lens, and juxtaposition. The results of a user study (N=32) indicate that juxtaposition yields the worst performance to compare distributions, whereas swipe and magic lens perform the best in terms of time needed to complete the experiment. Our research provides empirical evidence on how to design interactive idioms for multiclass density spatial data, and it opens a line of research for other domains and visual tasks.
The evaluation of research proposals and academic careers is subject to indicators of scientific productivity. Citations are critical signs of impact for researchers, and many indicators are based on these data. The literature shows that there are differences in citation patterns between areas. The scope and depth that these differences may have to motivate the extension of these studies considering types of articles and age groups of researchers. In this work, we conducted an exploratory study to elucidate what evidence there is about the existence of these differences in citation patterns. To perform this study, we collected historical data from Scopus. Analyzing these data, we evaluate if there are measurable differences in citation patterns. This study shows that there are evident differences in citation patterns between areas, types of publications, and age groups of researchers that may be relevant when carrying out researchers’ academic evaluation
We describe an ecosystem for teaching data science (DS) to engineers that blends theory, methods, and applications, developed at the Faculty of Physical and Mathematical Sciences (FCFM is its Spanish acronym), Universidad de Chile, over the last three years. This initiative has been motivated by the increasing demand for DS qualifications both from academic and professional environments.
Automatic hate speech detection in online social networks is an important open problem in Natural Language Processing (NLP). Hate speech is a multidimensional issue, strongly dependant on language and cultural factors. Despite its relevance, research on this topic has been almost exclusively devoted to English. Most supervised learning resources, such as labeled datasets and NLP tools, have been created for this same language. Considering that a large portion of users worldwide speak in languages other than English, there is an important need for creating efficient approaches for multilingual hate speech detection. In this work we propose to address the problem of multilingual hate speech detection from the perspective of transfer learning. Our goal is to determine if knowledge from one particular language can be used to classify other language, and to determine effective ways to achieve this. We propose a hate specific data representation and evaluate its effectiveness against general-purpose universal representations most of which, unlike our proposed model, have been trained on massive amounts of data. We focus on a cross-lingual setting, in which one needs to classify hate speech in one language without having access to any labeled data for that language. We show that the use of our simple yet specific multilingual hate representations improves classification results. We explain this with a qualitative analysis showing that our specific representation is able to capture some common patterns in how hate speech presents itself in different languages.
Our proposal constitutes, to the best of our knowledge, the first attempt for constructing multilingual specific-task representations. Despite its simplicity, our model outperformed the previous approaches for most of the experimental setups. Our findings can orient future solutions toward the use of domain-specific representations.
Once regarded as the poster child for democratic stability and sound policymaking in Latin America, in the last two decades Chile has experienced increasing levels of mistrust in political institutions and media elites, as well as disenfranchisement. In the wake of the mass protests of October 2019, the COVID-19 pandemic found the Chilean government at record levels of disapproval and with citizens skeptical of messages by authorities and legacy media. Based on data from an online survey and a narrative analysis of public discourse of key government interventions during the first six months of the pandemic, this chapter pays attention to individuals’ perceptions regarding the coronavirus crisis and offers a qualitative assessment of how the government’s handling was addressed in the public sphere. Findings show that Chileans have been skeptical of government measures and critical of officials’ handling of the situation, regardless of their support for the administration. With the news media struggling to hold authorities accountable, the resulting crisis has only deepened the political, economic, and social divisions within Chilean society.
This article presents a characterization of three type of strategies (non-electoral, non-partisan electoral, and partisan) pursued by the right in contemporary Latin America. By analyzing the recent course of the party systems in the region, we also argue that the so-called turn to the right constitutes a process of power alternation generated by the punishment of incumbents of the last decade and a half (mostly on the left), rather than a structural ideological realignment. This alternation in power occurs in a context, in which established parties tend to disappear or become substantially weaker, and in which short-lived electoral vehicles are gaining traction. Finally, we argue that there does not seem to be space in the region today -especially because of the social crisis associated with the effects of the covid-19 pandemic- for the strengthening of a neoliberal right. However, the current context does seem propitious for the emergence of right-wing outsiders, capable of structuring a pro-order agenda that incorporates, in different proportions, ‘iron-hand’ policies, value conservatism, and market liberalism.
“Cabildo Abierto: oportunidades y desafíos para la construcción partidaria en un sistema de partidos institucionalizado”. In Juan Andrés Moraes and Verónica Pérez Bentancur. De la estabilidad al equilibrio inestable: elecciones y comportamiento electoral. Instituto de Ciencia Política ( Con Felipe Monestier y Lihuen Nocetto).
Misleading information spread on social networks is often supported by activists who promote this type of information and bots that amplify their visibility. The need for useful and timely mechanisms of credibility assessment in social media has become increasingly indispensable. Efforts to tackle this problem in Spanish are growing. The last years have witnessed many efforts to develop methods to detect fake news, rumors, stances, and bots on the Spanish social web. This work leads to a systematic review of the literature that relates the efforts to develop this area in the Spanish language. The work identifies pending tasks for this community and challenges that require coordination among the leading investigators on the subject.
To more fully understand the belief gap hypothesis, this study examines the effect of political identity, education, and partisan media consumption on the formation of attitudes and false beliefs. Using a two-wave, nationally representative online survey of the U.S., we assess people’s attitudes and beliefs toward climate change, on the one hand, and Syrian refugees, on the other. Building on previous studies, we demonstrate that the effect of one’s political identity on attitudes and false beliefs is contingent upon education, which appears to widen the belief gap in consort with political identity.
We propose methods for extracting triples from Wikipedia’s HTML tables using a reference knowledge graph. Our methods use a distant-supervision approach to find existing triples in the knowledge graph for pairs of entities on the same row of a table, postulating the corresponding relation for pairs of entities from other rows in the corresponding columns, thus extracting novel candidate triples. Binary classifiers are applied on these candidates to detect correct triples and thus increase the precision of the output triples. We extend this approach with a preliminary step where we first group and merge similar tables, thereafter applying extraction on the larger merged tables. More specifically, we propose an observed schema for individual tables, which is used to group and merge tables. We compare the precision and number of triples extracted with and without table merging, where we show that with merging, we can extract a larger number of triples at a similar precision. Ultimately, from the tables of English Wikipedia, we extract 5.9 million novel and unique triples for Wikidata at an estimated precision of 0.718.
Due to the importance of linear algebra and matrix operations in data analytics, there has been a renewed interest in developing query languages that combine both standard relational operations and linear algebra operations. We survey aspects of the matrix query language MATLANG and extensions thereof, and connect matrix query languages to classical query languages and arithmetic circuits.
Tracking the historical events that lead to the interweaving of data and knowledge.
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures.
In this second part, we describe the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations. We conclude with the current challenges in this fascinating field.
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability to handle them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures.
In this first part, we describe the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings. In the quest for an ideal measure of repetitiveness, we uncover a fascinating web of relations between those measures, as well as the limits up to which the data can be recovered, and up to which direct access to the compressed data can be provided. This is the basic aspect of indexability, which is covered in the second part of this survey.
A sentiment lexicon is a list of expressions annotated according to affect categories such as positive, negative, anger and fear. Lexicons are widely used in sentiment classification of tweets, especially when labeled messages are scarce. Sentiment lexicons are prone to obsolescence due to: 1) the arrival of new sentiment-conveying expressions such as #trumpwall and #PrayForParis and 2) temporal changes in sentiment patterns of words (e.g., a scandal associated with an entity). In this paper, we propose a methodology for automatically inducing continuously updated sentiment lexicons from Twitter streams by training incremental word sentiment classifiers from time-evolving distributional word vectors. We experiment with various sketching techniques for efficiently building incremental word context matrices and study how the lexicon adapts to drastic changes in the sentiment pattern. Change is simulated by randomly picking some words from a testing partition of words and swapping their context with the context of words exhibiting the opposite sentiment. Our experimental results show that our approach allows for successfully tracking of the sentiment of words over time even when drastic change is induced.
Due to its huge impact on the overall quality of service (QoS) of wireless networks, both academic and industrial research have actively focused on analyzing the received signal strength in areas of particular interest. In this paper, we propose the improvement of signal-strength aggregation with a special focus on Mobile Crowdsourcing scenarios by avoiding common issues related to the mishandling of log-scaled signal values, and by the proposal of a novel aggregation method based on interpolation. Our paper presents two clear contributions. First, we discuss the misuse of log-scaled signal-strength values, which is a persistent problem within the mobile computing community. We present the physical and mathematical formalities on how signal-strength values must be handled in a scientific environment. Second, we present a solution to the difficulties of aggregating signal strength in Mobile Crowdsourcing scenarios, as a low number of measurements and nonuniformity in spatial distribution. Our proposed method obtained consistently lower Root Mean Squared Error (RMSE) values than other commonly used methods at estimating the expected value of signal strength over an area. Both contributions of this paper are important for several recent pieces of research that characterize signal strength for an area of interest.
We address the problem of storing and analyzing large datasets of passenger trips over public transportation networks that are of interest to network administrators trying to balance transportation offers (e.g., frequency of vehicles) according to the historical demand. We exploit the fact that all passenger trips made within the same vehicle share the same trajectories to reduce their redundancy and provide a representation, based on well-known compact data structures, that not only reduces the space requirements of the original passenger’s trajectories but also efficiently supports querying. Our solution uses two complementary representations: T-Matrices which excels at querying the aggregated network load, and TTCTR which represents all passenger trips and aims at counting the trips following a given pattern (i.e., how many passengers started/ended a trip at a given location or moved from a given location to another). In addition, we propose XCTR, a variant of TTCTR, which efficiently answers a wider range of queries at the cost of a moderate performance loss for some queries and some space overhead. Overall, our representation can handle a dataset of ten million trips within approximately 65% of its original size while supporting a wide range of queries in the order of microseconds.
We concentrate on ontology-mediated queries (OMQs) expressed using guarded Datalog∃ and conjunctive queries. Guarded Datalog∃ is a rule-based knowledge representation formalism inspired by the guarded fragment of first-order logic, while conjunctive queries represent a prominent database query language that lies at the core of relational calculus (i.e., first-order queries). For such guarded OMQs we discuss three main algorithmic tasks: query evaluation, query containment, and first-order rewritability. The first one is the task of computing the answer to an OMQ over an input database. The second one is the task of checking whether the answer to an OMQ is contained in the answer of some other OMQ on every input database. The third one asks whether an OMQ can be equivalently rewritten as a first-order query. For query evaluation, we explain how classical results on the satisfiability problem for the guarded fragment of first-order logic can be applied. For query containment, we discuss how tree automata techniques can be used. Finally, for first-order rewritability, we explain how techniques based on a more sophisticated automata model, known as cost automata, can be exploited.
The success of neural network embeddings has entailed a renewed interest in using knowledge graphs for a wide variety of machine learning and information retrieval tasks. In particular, current recommendation methods based on graph embeddings have shown state-of-the-art performance. These methods commonly encode latent rating patterns and content features. Different from previous work, in this paper, we propose to exploit embeddings extracted from graphs that combine information from ratings and aspect-based opinions expressed in textual reviews. We then adapt and evaluate state-of-the-art graph embedding techniques over graphs generated from Amazon and Yelp reviews on six domains, outperforming baseline recommenders. Our approach has the advantage of providing explanations which leverage aspect-based opinions given by users about recommended items. Furthermore, we also provide examples of the applicability of recommendations utilizing aspect opinions as explanations in a visualization dashboard, which allows obtaining information about the most and least liked aspects of similar users obtained from the embeddings of an input graph.
While women are underrepresented in politics, recent improvements in women’s representation in legislative and executive bodies have spurred academic interest in the effects of electing women on a wide array of outcomes. Effects on bureaucracies, however, have received less attention. Do women mayors reform local bureaucracies differently than their men counterparts? We take advantage of rich administrative data from Chile to explore the effects of having a woman mayor on the size and gender composition of municipal bureaucracies. Using a regression discontinuity design in close electoral races, we find that women mayors reduce the size of local bureaucracies while simultaneously increasing the share of women public employees. Our findings thus show that women mayors’ approach to bureaucratic reform once in office differs from that of their men counterparts, and contribute to existing research on the consequences of electing women.
Electrolytic refining is the last step of pyrometallurgical copper production. Here, smelted copper is converted into high-quality cathodes through electrolysis. Cathodes that do not meet the physical quality standards are rejected and further reprocessed or sold at a minimum profit. Prediction of cathodic rejection is therefore of utmost importance to accurately forecast the electrorefining cycle economic production. Several attempts have been made to estimate this process outcomes, mostly based on physical models of the underlying electrochemical reactions. However, they do not stand the complexity of real operations. Data-driven methods, such as deep learning, allow modeling complex non-linear processes by learning representations directly from the data. We study the use of several recurrent neural network models to estimate the cathodic rejection of a cathodic cycle, using a series of operational measurements throughout the process. We provide an ARMAX model as a benchmark. Basic recurrent neural network models are analyzed first: a vanilla RNN and an LSTM model provide an initial approach. These are further composed into an Encoder-Decoder model, that uses an attention mechanism to selectively weight the input steps that provide most information upon inference. This model obtains 5.45% relative error, improving by 81.4% the proposed benchmark. Finally, we study the attention mechanism’s output to distinguish the most relevant electrorefining process steps. We identify the initial state as critical in predicting cathodic rejection. This information can be used as an input for decision support systems or control strategies to reduce cathodic rejection and improve electrolytic refining’s profitability.
Ranking items or people is a fundamental operation at the basis of several processes and services, not all of them happening online. Ranking is required for different tasks, including search, personalization, recommendation, and filtering. While traditionally ranking has been aimed solely at maximizing some global utility function, recently the awareness of potential discrimination for some of the elements to rank, has captured the attention of researchers, which have thus started devising ranking systems which are non-discriminatory or fair for the items being ranked. So far, researchers have mostly focused on group fairness, which is usually expressed in the form of constraints on the fraction of elements from some protected groups that should be included in the top -k positions, for any relevant k. These constraints are needed in order to correct implicit societal biases existing in the input data and reflected in the relevance or fitness score computed.
In this article, we tackle the problem of selecting a subset of individuals from a pool n >> k of candidates, maximizing global utility (i.e., selecting the “best” candidates) while respecting given group-fairness criteria. In particular, to tackle this Fair Top – kRanking problem, we adopt a ranked group-fairness definition which extends the standard notion of group fairness based on protected groups, by ensuring that the proportion of protected candidates in every prefix of the top- ranking remains statistically above, or indistinguishable from, a given minimum threshold. Our notion of utility requires, intuitively, that every individual included in the top -k should be more qualified than every candidate not included; and that for every pair of candidates in the top -k, the more qualified candidate should be ranked above.
The main contribution of this paper is an algorithm for producing a fair top -k ranking that can be used when more than one protected group is present, which means that a statistical test based on a multinomial distribution needs to be used instead of one for a binomial distribution, as the original FA*IR algorithms does. This poses important technical challenges and increases both the space and time complexity of the re-ranking algorithm. Our experimental assessment on real-world datasets shows that our approach yields small distortions with respect to rankings that maximize utility without considering our fairness criteria.
Substance abuse and mental health issues are severe conditions that affect millions. Signs of certain conditions have been traced on social media through the analysis of posts. In this paper we analyze textual cues that characterize and differentiate Reddit posts related to depression, eating disorders, suicidal ideation, and alcoholism, along with control posts. We also generate enhanced word embeddings for binary and multi-class classification tasks dedicated to the detection of these types of posts. Our enhancement method to generate word embeddings focuses on identifying terms that are predictive for a class and aims to move their vector representations close to each other while moving them away from the vectors of terms that are predictive for other classes. Variations of the embeddings are defined and evaluated through predictive tasks, a cosine similarity-based method, and a visual approach. We generate predictive models using variations of our enhanced representations with statistical and deep learning approaches. We also propose a method that leverages the properties of the enhanced embeddings in order to build features for predictive models. Results show that variations of our enhanced representations outperform in Recall, Accuracy, and F1-Score the embeddings learned with Word2vec , DistilBERT , GloVe ’s fine-tuned pre-learned embeddings and other methods based on domain adapted embeddings. The approach presented has the potential to be used on similar binary or multi-class classification tasks that deal with small domain-specific textual corpora.
The Lempel-Ziv 78 (LZ78) and Lempel-Ziv-Welch (LZW) text factorizations are popular, not only for bare compression but also for building compressed data structures on top of them. Their regular factor structure makes them computable within space bounded by the compressed output size. In this article, we carry out the first thorough study of low-memory LZ78 and LZW text factorization algorithms, introducing more efficient alternatives to the classical methods, as well as new techniques that can run within less memory space than the necessary to hold the compressed file. Our results build on hash-based representations of tries that may have independent interest.
In this work we model the Internet as a physical–logical interdependent network composed by the logical Internet network (Autonomous System level network), the physical Internet network (Internet backbone), and their interactions. We have tested the effect of adding physical links over the Internet’s robustness against both physical random attacks, and localized attacks. We add links using strategies that are simple enough to be used when information of the physical network is incomplete or not accurate enough to use more complex strategies. To measure the effect of adding links to the physical network our tests consider the logical network, and the set of interlinks to be constant. We tested four physical link addition strategies: random addition, distance addition, local hubs addition, and degree addition, over three different physical network models: Gabriel Graphs, -nearest neighbors, and relative neighborhood graphs, and two extreme space shapes based on the geography of real countries: a long and narrow space with a width to length ratio of (1:25), and square space with a (1:1) width to length ratio. Our results show that there are High Damage Localized Attacks (HDLA): localized attacks that cause the failure of more than half of the logical network after removing less than 9% of the physical nodes. Some HDLA can even result in total failure. We found that HDLA are caused by the failure of “bridge nodes” in the logical network. Our results show that adding links to the physical network improves the robustness against localized attacks, and physical random attacks. Adding physical links also decreases the damage caused by HDLA, but does not fully prevent them. We found that degree and random addition strategies improve the Internet’s robustness the most, while distance addition is the most cost efficient link addition strategy in terms of robustness improvement. We also found that the high robustness and low cost efficiency of random strategy is related to the length of the links added, highlighting the importance of simple features such as the length of the links added over the robustness of physical–logical interdependent networks . Our findings suggest that given cost constraints it may be better to add more physical links using distance addition than it is to add fewer physical links using degree or random link addition strategies, and that more cost efficient versions of degree strategy could be obtained by simply limiting the length of the links added by the strategy.
Even though many countries in Latin America have adopted FOI Laws, there are significant differences in the institutional design of FOI oversight institutions. Most explanations highlight the role of political competition in motivating political actors to design strong de jure FOI oversight institutions. The design of FOI oversight institutions in Chile, Peru and Uruguay, however, cannot fully be explained by political competition. We show how isomorphic pressures help explain variation in the de jure strength of the FOI oversight institutions. Our findings highlight the importance of considering domestic constraints on the diffusion of one-size-fits-all models. To analyze each case, we conducted a systematic process-tracing analysis. Our in-depth analysis allowed us to assess different theories concerning the specific institutional design of FOI oversight institutions.
An increasing amount of researchers use software images to capture the requirements and code dependencies needed to carry out computational experiments. Software images preserve the computational environment required to execute a scientific experiment and have become a crucial asset for reproducibility. However, software images are usually not properly documented and described, making it challenging for scientists to find, reuse and understand them. In this paper, we propose a framework for automatically describing software images in a machine-readable manner by (i) creating a vocabulary to describe software images; (ii) developing an annotation framework designed to automatically document the underlying environment of software images and (iii) creating DockerPedia, a Knowledge Graph with over 150,000 annotated software images, automatically described using our framework. We illustrate the usefulness of our approach in finding images with specific software dependencies, comparing similar software images, addressing versioning problems when running computational experiments; and flagging problems with vulnerable software dependencies.
Divisive politics have hit many Latin American countries hard in recent years, fueled by numerous underlying fissures and issues including economic inequality and exclusion, corruption, ideological differences, high levels of violence, and chronically weak state capacity. The coronavirus pandemic has only intensified these pressures. Latin America thus enters 2021 shadowed by an ominous sense that democracy is under extraordinary strain.
To help shine a light on these troubled waters and chart the risks ahead, this collection of essays by a notable set of regional experts examines recent developments in six key countries: Bolivia, Brazil, Chile, Colombia, Mexico, and Peru. Taken together, the different country accounts present a sobering picture, though not an unrelievedly negative one. Divisions are deep, economic troubles are widespread, and the pandemic continues to devastate the lives of countless people in the region. The risks for democracy are serious, ranging from the rupture of basic democratic structures to the potential emergence of new illiberal political figures and forces. Remedial steps are possible, but they will be challenging to carry out. The collection seeks to help engaged actors and observers throughout the region and beyond better understand the troubling dynamics of rising political division and formulate effective responses.
The Domain Name System (DNS) is a critical component of Internet infrastructure, as almost every activity on the Internet starts with a DNS query. Given its importance, there is increasing concern over its vulnerability to attacks and failures, as they can negatively affect all Internet-based resources. Thus, detecting these events is crucial to preserve the correct functioning of all DNS components, such as high-volume name servers for top-level domains (TLD). This article presents a near real-time Anomaly Detection Based on Prediction (AD-BoP) method, providing a useful and easily explainable methodology to effectively detect DNS anomalies. AD-BoP is based on the prediction of expected DNS traffic statistics, and could be especially helpful for TLD registry operators to preserve their services’ reliability. After an exhaustive analysis, AD-BoP is shown to improve the current state-of-the-art for anomaly detection in authoritative TLD name servers.
The Sistema de Evaluación de Impacto Ambiental (Environmental Impact Assessment System—SEIA) evaluates all projects potentially harmful to human health and the environment in Chile. Since its establishment, many projects approved by the SEIA have been contested by organized communities, especially in the energy sector. The question guiding our research is whether socio-environmental conflicts affect the evaluation times and the approval rates of projects under assessment. Using a novel database comprising all energy projects assessed by the SEIA, we analyzed 380 energy projects that entered the SEIA review process between 2012 and 2017 and matched these projects with protest events. Using linear and logit regression, we find no association between the occurrence of protests aimed at specific projects and the probability of project approval. We do, however, find that projects associated with the occurrence of protest events experience significantly longer review times. To assess the robustness of this finding, we compare two run-of-river plants proposed in Mapuche territory in Chile’s La Araucanía region. We discuss the broader implications of these findings for sustainable environmental decision making.
Large-scale pre-trained language models have shown remarkable results in diverse NLP applications. Unfortunately, these performance gains have been accompanied by a significant increase in computation time and model size, stressing the need to develop new or complementary strategies to increase the efficiency of these models. In this paper we propose DACT-BERT, a differentiable adaptive computation time strategy for BERT-like models. DACT-BERT adds an adaptive computational mechanism to BERT’s regular processing pipeline, which controls the number of Transformer blocks that need to be executed at inference time. By doing this, the model learns to combine the most appropriate intermediate representations for the task at hand. Our experiments demonstrate that our approach, when compared to the baselines, excels on a reduced computational regime and is competitive in other less restrictive ones.
Deep learning, one of the fastest-growing branches of artificial intelligence, has become one of the most relevant research and development areas of the last years, especially since 2012, when a neural network surpassed the most advanced image classification techniques of the time. This spectacular development has not been alien to the world of the arts, as recent advances in generative networks have made possible the artificial creation of high-quality content such as images, movies or music. We believe that these novel generative models propose a great challenge to our current understanding of computational creativity. If a robot can now create music that an expert cannot distinguish from music composed by a human, or create novel musical entities that were not known at training time, or exhibit conceptual leaps, does it mean that the machine is then creative? We believe that the emergence of these generative models clearly signals that much more research needs to be done in this area. We would like to contribute to this debate with two case studies of our own: TimbreNet, a variational auto-encoder network trained to generate audio-based musical chords, and StyleGAN Pianorolls, a generative adversarial network capable of creating short musical excerpts, despite the fact that it was trained with images and not musical data. We discuss and assess these generative models in terms of their creativity and we show that they are in practice capable of learning musical concepts that are not obvious based on the training data, and we hypothesize that these deep models, based on our current understanding of creativity in robots and machines, can be considered, in fact, creative.
The tension between health and economic considerations regarding COVID-19 has resulted in a framing contest, in which proponents and adversaries of strong containment measures hold oppositional frames about the pandemic. This study examines the effects of competing news frames on social media users’ policy preferences and the moderation of framing effects played by melodramatic news treatment. Results from a pre-registered online survey experiment in Chile (N = 518) show that participants exposed to Facebook posts with an economic frame were significantly less supportive of measures that restrict mobility (e.g., quarantines) than participants in the control group. Contrary to expectations, exposure to a public health frame also reduced support for stay-at-home orders, and the presence of melodramatic features had no significant impact on users’ preferences. Other variables, however, did alter these framing effects, such as fear of COVID-19 and frequency of social media news use. These findings paint a rather complex picture of framing effects during the pandemic in a digital media environment.
Compressing real-world graphs has many benefits such as improving or enabling the visualization in small memory devices, graph query processing, community search, and mining algorithms. This work proposes a novel compact representation for real sparse and clustered undirected graphs. The approach lists all the maximal cliques by using a fast algorithm and defines a clique graph based on its maximal cliques. Further, the method defines a fast and effective heuristic for finding a clique graph partition that avoids the construction of the clique graph. Finally, this partition is used to define a compact representation of the input graph. The experimental evaluation shows that this approach is competitive with the state-of-the-art methods in terms of compression efficiency and access times for neighbor queries, and that it recovers all the maximal cliques faster than using the original graph. Moreover, the approach makes it possible to query maximal cliques, which is useful for community detection.
We address the task of automatically generating a medical report from chest X-rays. Many authors have proposed deep learning models to solve this task, but they focus mainly on improving NLP metrics, such as BLEU and CIDEr, which are not suitable to measure clinical correctness in clinical reports. In this work, we propose CNN-TRG, a Template-based Report Generation model that detects a set of abnormalities and verbalizes them via fixed sentences, which is much simpler than other state-of-the-art NLG methods and achieves better results in medical correctness metrics.
We benchmark our model in the IU X-ray and MIMIC-CXR datasets against naive baselines as well as deep learning-based models, by employing the Chexpert labeler and MIRQI as clinical correctness evaluations, and NLP metrics as secondary evaluation. We also provide further evidence indicating that traditional NLP metrics are not suitable for this task by presenting their lack of robustness in multiple cases. We show that slightly altering a template-based model can increase NLP metrics considerably while maintaining high clinical performance. Our work contributes by a simple but effective approach for chest X-ray report generation, as well as by supporting a model evaluation focused primarily on clinical correctness metrics and secondarily on NLP metrics.
This study examines how newsroom work in the United States has changed in response to some of the latest developments in the news media environment. Using nationally representative survey data, we explore what professional routines American journalists have adopted to avoid spreading or being accused of publishing misinformation. Findings suggest that journalists have added new or intensified practices to increase accountability and transparency. In addition, role conceptions, perception of fake news, and responsibility for social media audiences impact the adoption of such practices. Journalists are more likely to embrace transparency than accountability, suggesting the emergence of new journalistic norms in today’s newsrooms.
Eating disorders are psychological conditions characterized by unhealthy eating habits. Anorexia nervosa (AN) is defined as the belief of being overweight despite being dangerously underweight. The psychological signs involve emotional and behavioral issues. There is evidence that signs and symptoms can manifest on social media, wherein both harmful and beneficial content is shared daily.
Political parties with activists are in decline due to various external shocks. Societal changes, like the emergence of new technologies of communication have diminished the role and number of activists, while party elites increasingly can make do without grassroots activists. However, recent scholarship concerning different democracies has shown how activism still matters for representation. This book contributes to this literature by analyzing the unique case of the Uruguayan Frente Amplio (FA), the only mass-organic, institutionalized leftist party in Latin America. Using thick description, systematic process tracing, and survey research, this case study highlights the value of an organization-centered approach for understanding parties’ role in democracy. Within the FA, organizational rules grant activists a significant voice, which imbues activists’ participation with a strong sense of efficacy. This book is an excellent resource for scholars and students of Latin America and comparative politics who are interested in political parties and the challenges confronting new democracies.
Alternatives to recurrent neural networks, in particular, architectures based on self-attention, are gaining momentum for processing input sequences. In spite of their relevance, the computational properties of such networks have not yet been fully explored.We study the computational power of the Transformer, one of the most paradigmatic architectures exemplifying self-attention. We show that the Transformer with hard-attention is Turing complete exclusively based on their capacity to compute and access internal dense representations of the data.Our study also reveals some minimal sets of elements needed to obtain this completeness result.
By replying to Kurt Weyland’s (2020) comparative study of populism, we revisit optimistic perspectives on the health of American democracy in light of existing evidence. Relying on a set-theoretical approach, Weyland concludes that populists succeed in subverting democracy only when institutional weakness and conjunctural misfortune are observed jointly in a polity, thereby conferring on the United States immunity to democratic reversal. We challenge this conclusion on two grounds. First, we argue that the focus on institutional dynamics neglects the impact of the structural conditions in which institutions are embedded, such as inequality, racial cleavages, and changing political attitudes among the public. Second, we claim that endogeneity, coding errors, and the (mis)use of Boolean algebra raise questions about the accuracy of the analysis and its conclusions. Although we are skeptical of crisp-set Qualitative Comparative Analysis as an adequate modeling choice, we replicate the original analysis and find that the paths toward democratic backsliding and continuity are both potentially compatible with the United States.
The aim of this paper is to identify, given certain democratic normative standards regarding deliberation, some pros as well as cons of possible online deliberation designs due to variations in two key design dimensions: namely, asynchronicity and anonymity. In particular, we consider one crucial aspect of deliberative argumentation: namely, its reciprocity, which puts interaction centre stage to capture the back-and-forth of reasons. More precisely, we focus on two essential features of the deliberative interaction: namely, its listening widely and listening carefully. We conclude that one sort of online deliberation that combines the two design features of anonymity and asynchronicity is likely to better promote the reciprocity required for democratic deliberation than both natural and designed offline deliberations (such as the designed deliberation in Deliberative Polling) and online simulations of them.
As the number of vehicles and devices equipped with GPS technology has grown explosively, an urgent need has arisen for time- and space-efficient data structures to represent their trajectories. The most commonly desired queries are the following: queries about an object’s trajectory, range queries, and nearest neighbor queries. In this paper, we consider that the objects can move freely and we present a new compressed data structure for storing their trajectories, based on a combination of logs and snapshots, with the logs storing sequences of the objects’ relative movements and the snapshots storing their absolute positions sampled at regular time intervals. We call our data structure ContaCT because it provides Constant- time access to Compressed Trajectories. Its logs are based on a compact partial-sums data structure that returns cumulative displacement in constant time, and allows us to compute in constant time any object’s position at any instant, enabling a speedup when processing several other queries. We have compared ContaCT experimentally with another compact data structure for trajectories, called GraCT, and with a classic spatio-temporal index, the MVR-tree. Our results show that ContaCT outperforms the MVR-tree by orders of magnitude in space and also outperforms the compressed representation in time performance.
Compiler correctness, in its simplest form, is defined as the inclusion of the set of traces of the compiled program in the set of traces of the original program. This is equivalent to the preservation of all trace properties. Here, traces collect, for instance, the externally observable events of each execution. However, this definition requires the set of traces of the source and target languages to be the same, which is not the case when the languages are far apart or when observations are fine-grained. To overcome this issue, we study a generalized compiler correctness definition, which uses source and target traces drawn from potentially different sets and connected by an arbitrary relation. We set out to understand what guarantees this generalized compiler correctness definition gives us when instantiated with a non-trivial relation on traces. When this trace relation is not equality, it is no longer possible to preserve the trace properties of the source program unchanged. Instead, we provide a generic characterization of the target trace property ensured by correctly compiling a program that satisfies a given source property, and dually, of the source trace property one is required to show to obtain a certain target property for the compiled code. We show that this view on compiler correctness can naturally account for undefined behavior, resource exhaustion, different source and target values, side channels, and various abstraction mismatches. Finally, we show that the same generalization also applies to many definitions of secure compilation, which characterize the protection of a compiled program linked against adversarial code.
Decision support systems have become increasingly popular in the domain of agriculture. With the development of automated machine learning, agricultural experts are now able to train, evaluate and make predictions using cutting edge machine learning (ML) models without the need for much ML knowledge. Although this automated approach has led to successful results in many scenarios, in certain cases (e.g., when few labeled datasets are available) choosing among different models with similar performance metrics is a difficult task. Furthermore, these systems do not commonly allow users to incorporate their domain knowledge that could facilitate the task of model selection, and to gain insight into the prediction system for eventual decision making. To address these issues, in this paper we present AHMoSe, a visual support system that allows domain experts to better understand, diagnose and compare different regression models, primarily by enriching model-agnostic explanations with domain knowledge. To validate AHMoSe, we describe a use case scenario in the viticulture domain, grape quality prediction, where the system enables users to diagnose and select prediction models that perform better. We also discuss feedback concerning the design of the tool from both ML and viticulture experts.
In digital archaeology, a large research area is concerned with the computer-aided analysis of 3D captured ancient pottery objects. A key aspect thereby is the analysis of motifs and patterns that were painted on these objects’ surfaces. In particular, the automatic identification and segmentation of repetitive patterns is an important task serving different applications such as documentation, analysis and retrieval. Such patterns typically contain distinctive geometric features and often appear in repetitive ornaments or friezes, thus exhibiting a significant amount of symmetry and structure. At the same time, they can occur at varying sizes, orientations and irregular placements, posing a particular challenge for the detection of similarities. A key prerequisite to develop and evaluate new detection approaches for such repetitive patterns is the availability of an expressive dataset of 3D models, defining ground truth sets of similar patterns occurring on their surfaces. Unfortunately, such a dataset has not been available so far for this particular problem. We present an annotated dataset of 82 different 3D models of painted ancient Peruvian vessels, exhibiting different levels of repetitiveness in their surface patterns. To serve the evaluation of detection techniques of similar patterns, our dataset was labeled by archaeologists who identified clearly definable pattern classes. Those given, we manually annotated their respective occurrences on the mesh surfaces. Along with the data, we introduce an evaluation benchmark that can rank different recognition techniques for repetitive patterns based on the mean average precision of correctly segmented 3D mesh faces. An evaluation of different incremental sampling-based detection approaches, as well as a domain specific technique, demonstrates the applicability of our benchmark. With this benchmark we especially want to address the geometry processing community, and expect it will induce novel approaches for pattern analysis based on geometric reasoning like 2D shape and symmetry analysis. This can enable novel research approaches in the Digital Humanities and related fields, based on digitized 3D Cultural Heritage artifacts. Alongside the source code for our evaluation scripts we provide our annotation tools for the public to extend the benchmark and further increase its variety.
In this work, we study two simple yet general complexity classes, based on logspace Turing machines, that provide a unifying framework for efficient query evaluation in areas such as information extraction and graph databases, among others. We investigate the complexity of three fundamental algorithmic problems for these classes: enumeration, counting, and uniform generation of solutions, and show that they have several desirable properties in this respect.
Both complexity classes are defined in terms of non-deterministic logspace transducers (NL-transducers). For the first class, we consider the case of unambiguous NL-transducers, and we prove constant delay enumeration and both counting and uniform generation of solutions in polynomial time. For the second class, we consider unrestricted NL-transducers, and we obtain polynomial delay enumeration, approximate counting in polynomial time, and polynomial-time randomized algorithms for uniform generation. More specifically, we show that each problem in this second class admits a fully polynomial-time randomized approximation scheme (FPRAS) and a polynomial-time Las Vegas algorithm (with preprocessing) for uniform generation. Remarkably, the key idea to prove these results is to show that the fundamental problem # NFA admits an FPRAS, where # NFA is the problem of counting the number of strings of length n (given in unary) accepted by a non-deterministic finite automaton (NFA). While this problem is known to be P-complete and, more precisely, SpanL-complete, it was open whether this problem admits an FPRAS. In this work, we solve this open problem and obtain as a welcome corollary that every function in SpanL admits an FPRAS.
Bolivia’s long-standing sociopolitical divisions were inflamed by a disputed election, as well as by the coronavirus pandemic and its economic fallout. Now, the new government must restore citizens’ trust or risk continued unrest.
In order to reduce the spread of COVID-19, many universities and colleges have closed their campus and implemented what researchers call ‘emergency online education’. This means that many faculty members are teaching in front of computer screens while students are staying at home and taking their courses remotely. Unfortunately, this leaves students without some advantages of residential education such as study spaces, face-to-face counseling, and recreational facilities. In the case of engineering students, this has also left them without access to maker spaces, laboratories, and field trips (among other activities that enrich their learning experience).
For understanding how the consequences of this pandemic have affected students’ well-being, some researchers have implemented cross-sectional survey studies. These types of studies frequently used to measure stakeholders’ needs of support services as they relate to courses, programs or involvement in institutional planning. So far, there is a growing body of knowledge regarding factors that have affected students’ mental health, along with scales to measure students’ anxiety levels. However, the pandemic has come with confusing and changing information, making it more difficult for educational institutions to implement timely support strategies to maintaining some sense of well-being among their students. Given the close relationship between student well-being and learning outcomes, more studies are needed to not only understand factors that might negatively affect students’ learning experiences, but also examine interventions that might positively impact students’ resilience.
This paper presents a Work-In-Progress (WIP) that was carried out in a large engineering school in Latin America. As many schools in many countries, this school shifted to online education during 2020. In order to monitor students’ needs in this remote learning context, a cross-sectional survey study was conducted to evaluate their use of different types of support interventions that have been implemented since the pandemic started. Specifically, this paper presents the perceived benefits of having implemented a mid-semester break of one week to reduce stress during the first academic period. During the week after the break, we applied an online anonymous survey to a convenience sample of 994 engineering students from different admission cohorts and majors. Findings not only reveal how many hours students declare that they spent studying, resting, and doing recreational activities during that break, but also the percentage of students that perceived that this break was beneficial to their overall well-being. Future work will focus on assessing other type of support interventions that were implemented throughout that year, besides providing recommendations to monitor and support engineering students in different educational settings.
Conjunctive queries are one of the most common class of queries used in database systems, and the best studied in the literature. A seminal result of Grohe, Schwentick, and Segoufin (STOC 2001) demonstrates that for every class G of graphs, the evaluation of all conjunctive queries whose underlying graph is in G is tractable if, and only if, G has bounded treewidth. In this work, we extend this characterization to the counting problem for conjunctive queries. Specifically, for every class C of conjunctive queries with bounded treewidth, we introduce the first fully polynomial-time randomized approximation scheme (FPRAS) for counting answers to a query in C, and the first polynomial-time algorithm for sampling answers uniformly from a query in C. As a corollary, it follows that for every class G of graphs, the counting problem for conjunctive queries whose underlying graph is in G admits an FPRAS if, and only if, G has bounded treewidth (unless BPP is different from P). In fact, our FPRAS is more general, and also applies to conjunctive queries with bounded hypertree width, as well as unions of such queries.
The key ingredient in our proof is the resolution of a fundamental counting problem from automata theory. Specifically, we demonstrate the first FPRAS and polynomial time sampler for the set of trees of size n accepted by a tree automaton, which improves the prior quasi-polynomial time randomized approximation scheme (QPRAS) and sampling algorithm of Gore, Jerrum, Kannan, Sweedyk, and Mahaney ’97. We demonstrate how this algorithm can be used to obtain an FPRAS for many open problems, such as counting solutions to constraint satisfaction problems (CSP) with bounded hypertree width, counting the number of error threads in programs with nested call subroutines, and counting valid assignments to structured DNNF circuits.
Video captioning is the task of predicting a semantic and syntactically correct sequence of words given some context video. The most successful methods for video captioning have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. In this paper, we address this limitation by considering syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. We tested our proposed architecture obtaining state-of-the-art results on two widely used video captioning datasets: the Microsoft Video Description (MSVD) dataset and the Microsoft Research Video-to-Text (MSR-VTT) dataset.
Scores based on Shapley values are widely used for providing explanations to classification results over machine learning models. A prime example of this is the influential SHAPscore, a version of the Shapley value that can help explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is a computationally intractable problem, it has recently been claimed that the SHAP-score can be computed in polynomial time over the class of decision trees. In this paper, we provide a proof of a stronger result over Boolean models: the SHAP-score can be computed in polynomial time over deterministic and decomposable Boolean circuits. Such circuits, also known as tractable Boolean circuits, generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees, Ordered Binary Decision Diagrams (OBDDs) and Free Binary Decision Diagrams (FBDDs). We also establish the computational limits of the notion of SHAP-score by observing that, under a mild condition, computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider, as removing one or the other renders the problem of computing the SHAP-score intractable (namely, #P-hard).
Consistent answers to a query from a possibly inconsistent database are answers that are simultaneously retrieved from every possible repair of the database. Repairs are consistent instances that minimally differ from the original inconsistent instance. It has been shown before that database repairs can be specified as the stable models of a disjunctive logic program. In this paper we show how to use the repair programs to transform the problem of consistent query answering into a problem of reasoning w.r.t. a theory written in second-order predicate logic. It also investigated how a first-order theory can be obtained instead by applying second-order quantifier elimination techniques.
There are some recent approaches and results about the use of answer-set programming for specifying counterfactual interventions on entities under classification, and reasoning about them. These approaches are flexible and modular in that they allow the seamless addition of domain knowledge. Reasoning is enabled by query answering from the answer-set program. The programs can be used to specify and compute responsibility-based numerical scores as attributive explanations for classification results.
raphs have become the best way we know of representing knowledge. The computing community has investigated and developed the support for managing graphs by means of digital technology. Graph databases and knowledge graphs surface as the most successful solutions to this program. This tutorial will provide a conceptual map of the data management tasks underlying these developments, paying particular attention to data models and query languages for graphs
LocWeb and TempWeb 2021 were the eleventh events in their workshop series and took place co-located on 12th April 2021 in conjunction with The Web Conference WWW 2021. They were intended to be held in Ljubljana, Slovenia as a potentially hybrid event, but due to the pandemic, were fully moved online.
LocWeb and TempWeb were held as one colocated session with a merged programme and shared topics to explore similarities and introduce attendees to the two related and complementary areas. LocWeb 2021 explored the intersection of location-based analytics and Web architecture with a focus on on Web-scale services and location-aware information access. TempWeb 2021 discussed temporal analytics at a Web scale with experts from science and industry.
We report on a community effort between industry and academia to shape the future of property graph constraints. The standardization for a property graph query language is currently underway through the ISO Graph Query Language (GQL) project. Our position is that this project should pay close attention to schemas and constraints, and should focus next on key constraints. The main purposes of keys are enforcing data integrity and allowing the referencing and identifying of objects. Motivated by use cases from our industry partners, we argue that key constraints should be able to have different modes, which are combinations of basic restriction that require the key to be exclusive, mandatory, and singleton. Moreover, keys should be applicable to nodes, edges, and properties since these all can represent valid real-life entities. Our result is PG-Keys, a flexible and powerful framework for defining key constraints, which fulfills the above goals. PG-Keys is a design by the Linked Data Benchmark Council’s Property Graph Schema Working Group, consisting of members from industry, academia, and ISO GQL standards group, intending to bring the best of all worlds to property graph practitioners. PG-Keys aims to guide the evolution of the standardization efforts towards making systems more useful, powerful, and expressive.
Among all Internet Quality of Service (QoS) indicators, Round-trip time (RTT), jitter and packet loss have been thoroughly studied due to their great impact on the overall network’s performance and the Quality of Experience (QoE) perceived by the users. Considering that, we managed to generate a real-world dataset with a comprehensive contextualization of these important quality indicators by passively monitoring the network in user-space. To generate this dataset, we first developed a novel Periodic Passive Ping (PePa Ping) methodology for Android devices. Contrary to other works, PePa Ping periodically obtains RTT, jitter, and number of lost packets of all TCP connections. This passive approach relies on the implementation of a local VPN server residing inside the client device to manage all Internet traffic and obtain QoS information of the connections established. The collected QoS indicators are provided directly by the Linux kernel, and therefore, they are exceptionally close to real QoS values experienced by users’ devices. Additionally, the PePa Ping application continuously measured other indicators related to each individual network flow, the state of the device, and the state of the Internet connection (either WiFi or Mobile). With all the collected information, each network flow can be precisely linked to a set of environmental data that provides a comprehensive contextualization of each individual connection.
In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported. The engine itself is founded on a combination of tried and tested techniques from relational data management, state-of-the-art algorithms for worst-case-optimal joins, as well as graph-specific algorithms for evaluating path queries. In this paper, we present the main design principles underlying MillenniumDB, describing the abstract graph model and query semantics supported, the concrete data model and query syntax implemented, as well as the storage, indexing, query planning and query evaluation techniques used. We evaluate MillenniumDB over real-world data and queries from the Wikidata knowledge graph, where we find that it outperforms other popular persistent graph database engines (including both enterprise and open source alternatives) that support similar query features.
The field of natural language understanding has experienced exponential progress in the last few years, with impressive results in several tasks. This success has motivated researchers to study the underlying knowledge encoded by these models. Despite this, attempts to understand their semantic capabilities have not been successful, often leading to non conclusive, or contradictory conclusions among different works. Via a probing classifier, we extract the underlying knowledge graph of nine of the most influential language models of the last years, including word embeddings, text generators, and context encoders. This probe is based on concept relatedness, grounded on WordNet. Our results reveal that all the models encode this knowledge, but suffer from several inaccuracies. Furthermore, we show that the different architectures and training strategies lead to different model biases. We conduct a systematic evaluation to discover specific factors that explain why some concepts are challenging. We hope our insights will motivate the development of models that capture concepts more precisely.
Recent machine-learning approaches to deterministic search and domain-independent planning employ policy learning to speed up search. Unfortunately, when attempting to solve a search problem by successively applying a policy, no guarantees can be given on solution quality. The problem of how to effectively use a learned policy within a bounded-suboptimal search algorithm remains largely as an open question. In this paper, we propose various ways in which such policies can be integrated into Focal Search, assuming that the policy is a neural network classifier. Furthermore, we provide mathematical foundations for some of the resulting algorithms. To evaluate the resulting algorithms over a number of policies with varying accuracy, we use synthetic policies which can be generated for a target accuracy for problems where the search space can be held in memory. We evaluate our focal search variants over three benchmark domains using our synthetic approach, and on the 15-puzzle using a neural network learned using 1.5 million examples. We observe that Discrepancy Focal Search, which we show expands the node which maximizes an approximation of the probability that its corresponding path is a prefix of an optimal path, obtains, in general, the best results in terms of runtime and solution quality.
In this talk I will present two recent examples of my research on explainability problems over machine learning (ML) models. In rough terms, these explainability problems deal with specific queries one poses over a ML model in order to obtain meaningful justifications for their results. Both of the examples I will present deal with “local” and “post-hoc” explainability queries. Here “local” means that we intend to explain the output of the ML model for a particular input, while “post-hoc” refers to the fact that the explanation is obtained after the model is trained. In the process I will also establish connections with problems studied in data management. This with the intention of suggesting new possibilities for cross-fertilization between the area and ML.
The first example I will present refers to computing explanations with scores based on Shapley values, in particular with the recently proposed, and already influential, SHAP-score. This score provides a measure of how different features in the input contribute to the output of the ML model. We provide a detailed analysis of the complexity of this problem for different classes of Boolean circuits. In particular, we show that the problem of computing SHAP-scores is tractable as long as the circuit is deterministic and decomposable, but becomes computationally hard if any of these restrictions is lifted. The tractability part of this result provides a generalization of a recent result stating that, for Boolean hierarchical conjunctive queries, the Shapley-value of the contribution of a tuple in the database to the final result can be computed in polynomial time.
The second example I will present refers to the comparison of different ML models in terms of important families of (local and post-hoc) explainability queries. For the models, I will consider multi-layer perceptrons and binary decision diagrams. The main object of study will be the computational complexity of the aforementioned queries over such models. The obtained results will show an interesting theoretical counterpart to wisdom’s claims on interpretability. This work also suggests the need for developing query languages that support the process of retrieving explanations from ML models, and also for obtaining general tractability results for such languages over specific classes of models.
Over the years, population protocols with the goal of reaching consensus have been studied in great depth. However, many systems in the real-world do not result in all agents eventually reaching consensus, but rather in the opposite: they converge to a state of rich diversity. Consider for example task allocation in ants. If eventually all ants perform the same task, then the colony will perish (lack of food, no brood care, etc.). Then, it is vital for the survival of the colony to have a diverse set of tasks and enough ants working on each task. What complicates matters is that ants need to switch tasks periodically to adjust the needs of the colony; e.g., when too many foragers fell victim to other ant colonies. A further difficulty is that not all tasks are equally important and maybe they need to keep certain proportions in the distribution of the task. How can ants keep a healthy and balanced allocation of tasks?
To answer this question, we propose a simple population protocol for n agents on a complete graph and an arbitrary initial distribution of k colours (tasks). In this protocol we assume that each colour i has an associated weight (importance or value) w_i ≥ 1. By denoting w as the sum of the weights of different colours, we show that the protocol converges in O(w^2 n łog n) rounds to a configuration where the number of agents supporting each colour i is concentrated on the fair share w_in/w and will stay concentrated for a large number of rounds, w.h.p.
%that is, every task is being performed by a number of agents proportional to its weight w_i. Our protocol has many interesting properties: agents do not need to know other colours and weights in the system, and our protocol requires very little memory per agent. Furthermore, the protocol guarantees fairness meaning that over a long period each agent has each colour roughly a number of times proportional to the weight of the colour. Finally, our protocol also fulfils sustainability meaning that no colour ever vanishes. All of these properties still hold when an adversary adds agents or colours.
Reinforcement learning allows learning very accurate heuristics for hard combinatorial puzzles like the 15-puzzle, the 24-
puzzle, and Rubik’s cube. In this paper, we empirically investigate how to exploit these learned heuristics in the context
of (deterministic) heuristic search with bounded suboptimality guarantees, using the learned heuristic for the 15 and 24-
puzzle of DeepCubeA. We show that Focal Search (FS), in its most straightforward form, that is, using the learned heuristic to sort the focal list, has poor performance when compared to Focal Discrepancy Search (FDS), a version of FS that we propose that uses a discrepancy function to sort the focal list. This is interesting the best performing algorithm does not use the heuristic values themselves but just the ranking between the successors of the node. In addition, we show FDS is competitive with satisficing search algorithms Weighted A* and Greedy Best-First Search.
Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level representations. In this work, we propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations. As a result, our proposed approach is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network. By experimenting with benchmarks designed to evaluate discourse-related knowledge using pre-trained sentence representations, we demonstrate that our approach improves performance in 6 out of 11 tasks by excelling in discourse relationship detection.
As an essential high-level task of video understanding topic, automatically describing a video with natural language has recently gained attention as a fundamental challenge in computer vision. Previous models for video captioning have several limitations, such as the existence of gaps in current semantic representations and the inexpressibility of the generated captions. To deal with these limitations, in this paper, we present a new architecture that we call Attentive Visual Semantic Specialized Network (AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. Besides, the model is capable of learning to improve the expressiveness of generated captions attending to their length, using a sentence-length-related loss function. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description (MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGE L .
We describe how answer-set programs can be used to declaratively specify counterfactual interventions on entities under classification, and reason about them. In particular, they can be used to define and compute responsibility scores as attribution-based explanations for outcomes from classification models. The approach allows for the inclusion of domain knowledge and supports query answering. A detailed example with a naive-Bayes classifier is presented.
We present an RDF Data Cube – integrated from numerous sources on the Web – that describes countries in terms of general variables (e.g., GDP, population density) and COVID-19 variables. On top of this data cube, we develop a system that computes and visualises correlations between these variables, providing insights into the factors that correlate with COVID-19 cases, deaths, etc., on an international level.
Representing a static set of integers S, |𝑆|=𝑛 from a finite universe 𝑈=[1..𝑢] is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of 𝗋𝖺𝗇𝗄 and 𝗌𝖾𝗅𝖾𝖼𝗍 on S; if S is viewed as its characteristic vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures. Although there is an information-theoretic lower bound of ℬ(𝑛,𝑢)=ł𝑔(𝑢𝑛) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain runs of| ℓ>1 consecutive elements, one that occurs in many practical situations. Let 𝒞(𝑛) denote the class of (𝑢𝑛) distinct sets of 𝑛 elements over the universe [1..𝑢]. Let also 𝒞(𝑛)𝑔⊂𝒞(𝑛) contain the sets whose 𝑛 elements are arranged in 𝑔ł𝑒𝑛 runs of ℓ𝑖≥1 consecutive elements from U for 𝑖=1,ł𝑑𝑜𝑡𝑠,𝑔, and let 𝒞(𝑛)𝑔,𝑟⊂𝒞(𝑛)𝑔 contain all sets that consist of g runs, such that 𝑟ł𝑒𝑔 of them have at least 2 elements.
The topological model for spatial objects identifies common boundaries between regions, explicitly storing adjacency relations, which not only improves the efficiency of topologyrelated queries, but also provides advantages such as avoiding data duplication and facilitating data consistency. Recently, a compact representation of the topological model based on planar graph embeddings was proposed. In this article, we provide an elegant generalization of such a representation to support hierarchies of vector objects, which better fits the multi-granular nature of spatial data, such as the political and administrative partition of a country. This representation adds a small space on top of the succinct base representation of each granularity, while efficiently answering new topology-related queries between objects not necessarily at the same level of granularity.
We propose an end-to-end deep learning model for translating free-form natural language instructions to a high-level plan for behavioral robot navigation. We use attention models to connect information from both the user instructions and a topological representation of the environment. We evaluate our model’s performance on a new dataset containing 10,050 pairs of navigation instructions. Our model significantly outperforms baseline approaches. Furthermore, our results suggest that it is possible to leverage the environment map as a relevant knowledge base to facilitate the translation of free-form navigational instruction.
In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graph
Abstract: Many of the interactions between users on social networks are controversial, specially in polarized environments. In effect, rather than producing a space for deliberation, these environments foster the emergence of users that disqualify the position of others. On news sites, comments on the news are characterized by such interactions. This is detrimental to the construction of a deliberative and democratic climate, stressing the need for automatic tools that can provide an early detection of polarization and controversy. We introduce GENE (graph generation conditioned on named entities), a representation of user networks conditioned on the named entities (personalities, brands, organizations) which users comment upon. GENE models the leaning that each user has concerning entities mentioned in the news. GENE graphs is able to segment the user network according to their polarity. Using the segmented network, we study the performance of two controversy indices, the existing Random Walks Controversy (RWC) and another one we introduce, Relative Closeness Controversy (RCC). These indices measure the interaction between the network’s poles providing a metric to quantify the emergence of controversy. To evaluate the performance of GENE, we model the network of users of a popular news site in Chile, collecting data in an observation window of more than three years. A large-scale evaluation using GENE, on thousands of news, allows us to conclude that over 60% of user comments have a predictable polarity. This predictability of the user interaction scenario allows both controversy indices to detect a controversy successfully. In particular, our introduced RCC index shows satisfactory performance in the early detection of controversies using partial information collected during the first hours of the news event, with a sensitivity to the target class exceeding 90%.
Works on knowledge graphs and graph-based data management often focus either on graph query languages or on frameworks for graph analytics, where there has been little work in trying to combine both approaches. However, many real-world tasks conceptually involve combinations of these approaches: a graph query can be used to select the appropriate data, which is then enriched with analytics, and then possibly filtered or combined again with other data by means of a query language. In this paper we propose a language that is well-suited for both graph querying and analytical tasks. We propose a minimalistic extension of SPARQL to allow for expressing analytical tasks over existing SPARQL infrastructure; in particular, we propose to extend SPARQL with recursive features, and provide a formal syntax and semantics for our language. We show that this language can express key analytical tasks on graphs (in fact, it is Turing complete). Moreover, queries in this language can also be compiled into sequences of iterations of SPARQL update statements. We show how procedures in our language can be implemented over off-the-shelf SPARQL engines, with a specialised client that can leverage database operations to improve the performance of queries. Results for our implementation show that procedures for popular analytics currently run in seconds or minutes for selective sub-graphs (our target use-case).
Abstract. We investigate global measures of vertex similarity for knowledge graphs. While vertex similarity has been explored in the context of directed, unlabelled graphs, measures based on recursive algorithms or learning frameworks can be costly to compute, assume labelled data, and/or provide poorly-interpretable results. Knowledge graphs further imply unique challenges for vertex similarity in terms of scale and diversity. We thus propose and explore global measures of vertex similarity for Knowledge Graphs that (i) are unsupervised, (ii) offer explanations of similarity results; (iii) take into consideration edge labels; and (iv) are robust in terms of redundant or interdependent information. Given that these measures can still be costly to compute precisely, we propose an approximation strategy that enables computation at scale. We compare our measures with a recursive measure (SimRank) for computing vertex similarity over subsets of Wikidata.
Abstract. We explore solutions for representing archives of versioned RDF data using the SPARQL standard and off-the-shelf engines. We consider six representations of RDF archives based on named graphs, and describe how input queries can be automatically rewritten to return solutions for a particular version, or solutions that change between versions. We evaluate these alternatives over an archive of 8 weekly versions of Wikidata and 146 queries using Virtuoso as the SPARQL engine.
Abstract. Given a Wikidata claim, we explore automated methods for locating references that support that claim. Our goal is to assist human editors in referencing claims, and thus increase the ratio of referenced claims in Wikidata. As an initial approach, we mine links from the references section of English Wikipedia articles, download and index their content, and use standard relevance-based measures to find supporting documents. We consider various forms of search phrasings, as well as different scopes of search. We evaluate our methods in terms of the coverage of reference documents collected from Wikipedia. We also develop a gold standard of sample items for evaluating the relevance of suggestions. Our results in general reveal that the coverage of Wikipedia reference documents for claims is quite low, but where a reference document is available, we can often suggest it within the first few results.
We propose laconic classification as a novel way to understand and compare the performance of diverse image classifiers. The goal in this setting is to minimise the amount of information (aka. entropy) required in individual test images to maintain correct classification. Given a classifier and a test image, we compute an approximate minimal-entropy positive image for which the classifier provides a correct classification, becoming incorrect upon any further reduction. The notion of entropy offers a unifying metric that allows to combine and compare the effects of various types of reductions (e.g., crop, colour reduction, resolution reduction) on classification performance, in turn generalising similar methods explored in previous works. Proposing two complementary frameworks for computing the minimal-entropy positive images of both human and machine classifiers, in experiments over the ILSVRC test-set, we find that machine classifiers are more sensitive entropy-wise to reduced resolution (versus cropping or reduced colour for machines, as well as reduced resolution for humans), supporting recent results suggesting a texture bias in the ILSVRC-trained models used. We also find, in the evaluated setting, that humans classify the minimal-entropy positive images of machine models with higher precision than machines classify those of humans.
In these lecture notes, we provide an overview of some of the high-level research directions and open questions relating to knowledge graphs. We discuss six high-level concepts relating to knowledge graphs: data models, queries, ontologies, rules, embeddings and graph neural networks. While traditionally these concepts have been explored by different communities in the context of graphs, more recent works have begun to look at how they relate to one another, and how they can be unified. In fact, at a more foundational level, we can find some surprising relations between the different concepts. The research questions we explore mostly involve combinations of these concepts.
In this chapter, we provide a detailed primer on the second version of the Web Ontology Language (OWL 2) standard. We first motivate the need for such a standard, discussing the role and importance of ontologies on the Web. We then describe how ontology languages, which themselves can be formally defined through model theory, can subsequently be used to formally define ontologies. Thereafter we discuss the OWL vocabulary used to define the semantics of classes, properties, individuals, and datatypes within ontologies. We cover some of the main reasoning tasks for ontologies and the applications in which they are used. We discuss how these core reasoning tasks are undecidable for the full OWL (2) language and outline the sub-languages (aka. profiles) proposed by the standard that allow for more efficient reasoning procedures. We conclude by reflecting on the importance of having expressive ontologies on the Web of Data, and discuss open challenges.
This book concisely brings together the key standards and best practices relating to modelling, querying, validating and linking machine-readable data and semantics on the Web. Alongside practical examples and formal definitions, the book shows how these standards contribute to – and have been used thus far on – the “Web of Data”: a machine readable evolution of the Web marked by increased automation, enabling powerful Web applications capable of discovering, cross-referencing, and organising data from numerous websites in a matter of seconds. The book is divided into nine chapters, the first of which highlights the fundamental shortcomings of the current Web that illustrate the need for increased machine readability. The next chapter outlines the core concepts of the “Web of Data”, discussing use-cases on the Web where they have already been deployed. “Resource Description Framework (RDF)” describes the graph-structured data model proposed by the Semantic Web community as a common data model for the Web. The chapter on “RDF Schema (RDFS) and Semantics” presents a lightweight ontology language used to define an initial semantics for RDF graphs. In turn, the chapter “Web Ontology Language (OWL)” elaborates on a much more expressive ontology language built upon RDFS. In “SPARQL Query Language” a language for querying and updating RDF graphs is described. “Shape Constraints and Expressions (SHACL/ShEx)” introduces two languages for describing the expected structure of – and expressing constraints over – RDF graphs for the purposes of validation. “Linked Data” discusses the principles and best practices by which interlinked (RDF) data can be published on the Web, and how they have been adopted. The final chapter highlights open problems and concludes with a general discussion on the future of the Web of Data. The book is intended for students, researchers and advanced practitioners interested in learning more about the Web of Data, and about closely related topics such as the Semantic Web, Knowledge Graphs, Linked Data, Graph Databases, Ontologies, etc. Offering a range of accessible examples and exercises, it can be used as a textbook for students and other newcomers to the field. It can also serve as a reference handbook for researchers and developers, as it offers up-to-date details on key standards (RDF, RDFS, OWL, SPARQL, SHACL, ShEx, RDB2RDF, LDP), along with formal definitions and references to further literature. The associated website webofdatabook.org offers a wealth of complementary material, including solutions to the exercises, slides for classes, interactive examples, and a section for comments and questions.
This chapter provides a detailed introduction to the SPARQL Protocol and RDF Query Language (SPARQL 1.1): the standard query language for RDF. After some initial motivation, we delve into the features of the query language, illustrated with concrete examples. We then formally define the semantics of these query features. We next discuss how federated queries can be used to evaluate queries over multiple remote sources on the Web. We detail the SPARQL Update language, which allows for modifying the data indexed by a SPARQL query service. We introduce SPARQL Entailment Profiles, which allow for query results to consider entailments, including support for RDF, RDFS and OWL semantics. We further discuss the HTTP-based protocol by which requests can be issued to a SPARQL service over the Web, as well as the SPARQL Service Description vocabulary, which can be used to describe and advertise the features supported by such services. We conclude by discussing the importance of SPARQL for the Web of Data, the key research directions that are currently being explored, as well as open challenges.
In this chapter, we introduce two languages for describing shapes and constraints for RDF graphs, namely the Shapes Constraint Language (SHACL) and the Shape Expressions Language (ShEx 2.1). Both languages allow for defining constraints over RDF graphs in terms of what data are expected, what data are obligatory, what data are allowed, and what data are disallowed. This in turn allows RDF graphs to be validated with respect to the specified constraints. We first look at SHACL, describing the SHACL-Core fragment and the constraints it allows. We then discuss how SHACL-SPARQL allows for further constraints to be expressed using SPARQL query syntax. Turning to ShEx, we describe its syntaxes, and how it differs from SHACL. We outline and provide a semantics for an abstract shapes syntax that generalises SHACL and ShEx. We conclude with a general discussion of the role of shapes languages on the Web of Data, as well as open challenges.
This chapter provides a detailed primer for the Resource Description Framework (RDF 1.1) standard, proposed as a common data model for publishing and exchanging structured data on the Web. We first motivate the need for a data model like RDF. We then describe the types of terms used in RDF: the basic building blocks of the framework. We discuss how these terms can be combined to make coherent statements in the form of RDF triples, and how triples form graphs and datasets. Thereafter we discuss the RDF vocabulary: a built-in set of terms used for modeling more complex data, such as complex relations and ordered lists. Finally, we give an overview of the different syntaxes by which RDF can be serialized and communicated.
This chapter presents an in-depth primer for the RDF Schema (RDFS 1.1) standard, which is primarily used to define a lightweight semantics for the classes and properties used in RDF graphs. After an initial motivation and overview, we discuss the RDFS vocabulary, and how it can be used to define sub-classes, sub-properties, domain and ranges, amongst other types of definitions. We then describe in detail how the semantics of RDF(S) can be formalised in a model-theoretic way, discussing key concepts such as interpretations, models, satisfiability and entailment. We introduce various semantics for RDF(S), including the simple semantics, D semantics, RDF semantics, and the RDFS semantics. We conclude the chapter by discussing how rules can be used to support entailment under such semantics.
This chapter motivates, introduces and describes Linked Data, which centres around a concise set of principles by which data can be published and interlinked on the Web, and by which a Web of Data can ultimately be formed. We first discuss the core Linked Data principles, which espouse the use of HTTP IRIs to identify the entities described in data, returning a machine-readable description of the entity (typically RDF) when its corresponding IRI is looked up on the Web. We then discuss some further best practices for publishing data conformant with the Linked Data principles in a way that enhances interoperability. We discuss the Linking Open Data (LOD) project founded on the idea of publishing Open Data on the Web in a standard, machine-readable fashion using Linked Data; we describe the most prominent datasets and vocabularies that have results from this initiative. We then discuss tools and techniques for converting legacy data to RDF, discovering links, and hosting Linked Data. We subsequently discuss the Linked Data Platform: a standard that outlines the protocols and resources needed to build a new generation of read–write Linked Data applications. We conclude the chapter with a discussion of open challenges yet to be addressed in the context of Linked Data.
Abstract: In this work, we focus on the Visual Question Answering (VQA) task, where a model must answer a question based on an image, and the VQA-Explanations task, where an explanation is produced to support the answer. We introduce an interpretable model capable of pointing out and consuming information from a novel Knowledge Base (KB) composed of real-world relationships between objects, along with labels mined from available region descriptions and object annotations. Furthermore, this model provides a visual and textual explanations to complement the KB visualization. The use of a KB brings two important consequences: enhance predictions and improve interpretability. We achieve this by introducing a mechanism that can extract relevant information from this KB, and can point out the relations better suited for predicting the answer. A supervised attention map is generated over the KB to select the relevant relationships from it for each question-image pair. Moreover, we add image attention supervision on the explanations module to generate better visual and textual explanations. We quantitatively show that the predicted answers improve when using the KB; similarly, explanations improve with this and when adding image attention supervision. Also, we qualitatively show that the KB attention helps to improve interpretability and enhance explanations. Overall, the results support the benefits of having multiple tasks to enhance the interpretability and performance of the model.
The Entity Linking (EL) task involves linking mentions of entities in a text with their identifier in a Knowledge Base (KB) such as Wikipedia, BabelNet, DBpedia, Freebase, Wikidata, YAGO, etc. Numerous techniques have been proposed to address this task down through the years. However, not all works adopt the same convention regarding the entities that the EL task should target; for example, while some EL works target common entities like “interview” appearing in the KB, others only target named entities like “Michael Jackson”. The lack of consensus on this issue (and others) complicates research on the EL task; for example, how can the performance of EL systems be evaluated and compared when systems may target different types of entities? In this work, we first design a questionnaire to understand what kinds of mentions and links the EL research community believes should be targeted by the task. Based on these results we propose a fine-grained categorization scheme for EL that distinguishes different types of mentions and links. We propose a vocabulary extension that allows to express such categories in EL benchmark datasets. We then relabel (subsets of) three popular EL datasets according to our novel categorization scheme, where we additionally discuss a tool used to semi-automate the labeling process. We next present the performance results of five EL systems for individual categories. We further extend EL systems with Word Sense Disambiguation and Coreference Resolution components, creating initial versions of what we call Fine-Grained Entity Linking (FEL) systems, measuring the impact on performance per category. Finally, we propose a configurable performance measure based on fuzzy sets that can be adapted for different application scenarios Our results highlight a lack of consensus on the goals of the EL task, show that the evaluated systems do indeed target different entities, and further reveal some open challenges for the (F)EL task regarding more complex forms of reference for entities.
Abstract: This paper presents a novel attention-based algorithm for achieving adaptive computation called DACT, which, unlike existing ones, is end-to-end differentiable. Our method can be used in conjunction with many networks; in particular, we study its application to the widely know MAC architecture, obtaining a significant reduction in the number of recurrent steps needed to achieve similar accuracies, therefore improving its performance to computation ratio. Furthermore, we show that by increasing the maximum number of steps used, we surpass the accuracy of even our best non-adaptive MAC in the CLEVR dataset, demonstrating that our approach is able to control the number of steps without significant loss of performance. Additional advantages provided by our approach include considerably improving interpretability by discarding useless steps and providing more insights into the underlying reasoning process. Finally, we present adaptive computation as an equivalent to an ensemble of models, similar to a mixture of expert formulation. Both the code and the configuration files for our experiments are made available to support further research in this area.
“Los gobiernos socialdemócratas en Chile” in “¿Giro a la izquierda o viraje al centro? Argentina y el Cono Sur, entre la continuidad y el cambio”.
Technical Report: “PLAN ESTRATÉGICO DE LA LOGÍSTICA URBANO-PORTUARIA Mejores ciudades portuarias en el Área Metropolitana de Concepción”.
Abstract: We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Semantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Semantic Web resources (languages/ontologies/knowledge-bases/tools) to improve Information Extraction, and/or using Information Extraction to populate the Semantic Web. In more detail, we focus on the extraction and linking of three elements: entities, concepts and relations. Extraction involves identifying (textual) mentions referring to such elements in a given unstructured or semi-structured input source. Linking involves associating each such mention with an appropriate disambiguated identifier referring to the same element in a Semantic Web knowledge-base (or ontology), in some cases creating a new identifier where necessary. With respect to entities, works involving (Named) Entity Recognition, Entity Disambiguation, Entity Linking, etc. in the context of the Semantic Web are considered. With respect to concepts, works involving Terminology Extraction, Keyword Extraction, Topic Modeling, Topic Labeling, etc., in the context of the Semantic Web are considered. Finally, with respect to relations, works involving Relation Extraction in the context of the Semantic Web are considered. The focus of the majority of the survey is on works applied to unstructured sources (text in natural language); however, we also provide an overview of works that develop custom techniques adapted for semi-structured inputs, namely markup documents and web tables.
Abstract: CNN-based models currently provide state-of-the-art performance in image categorization tasks. While these methods are powerful in terms of representational capacity, they are generally not conceived with explicit means to control complexity. This might lead to scenarios where resources are used in a non-optimal manner, increasing the number of unspecialized or repeated neurons, and overfitting to data. In this work we propose CompactNets, a new approach to visual recognition that learns a hierarchy of shared, discriminative, specialized, and compact representations. CompactNets naturally capture the notion of compositional compactness, a characterization of complexity in compositional models, consisting on using the smallest number of patterns to build a suitable visual representation. We employ a structural regularizer with group-sparse terms in the objective function, that induces on each layer, an efficient and effective use of elements from the layer below. In particular, this allows groups of top-level features to be specialized based on category information. We evaluate CompactNets on the ILSVRC12 dataset, obtaining compact representations and competitive performance, using an order of magnitude less parameters than common CNN-based approaches. We show that CompactNets are able to outperform other group-sparse-based approaches, in terms of performance and compactness. Finally, transfer-learning experiments on small-scale datasets demonstrate high generalization power, providing remarkable categorization performance with respect to alternative approaches.
Societies and industries are rapidly changing due to the adoption of artificial intelligence (AI) and will face deep transformations in upcoming years. In this scenario, it becomes critical for under-represented communities in technology, in particular developing countries like Latin America, to foster initiatives that are committed to developing tools for the local adoption of AI. Latin America, as well as many non-English speaking regions, face several problems for the adoption of AI technology, including the lack of diverse and representative resources for automated learning tasks. A highly problematic area in this regard is natural language processing (NLP), which is strongly dependent on labeled datasets for learning. However, most state-of-the-art NLP resources are allocated to English. Therefore, creating efficient NLP tools for diverse languages requires an important investment of time and financial resources. To deal with such issues, our group has worked toward creating language-agnostic approaches as well as adapting and improving existing NLP techniques to local problems. In addition, we have focused on producing new state-of-the-art NLP publicly available data and models in Spanish. Next, we briefly present some of them.
Abstract: More than two decades have passed since the establishment of the initial cornerstones of the Semantic Web. Since its inception, opinions have remained divided regarding the past, present and potential future impact of the Semantic Web. In this paper – and in light of the results of over two decades of development on both the Semantic Web and related technologies – we reflect on the current status of the Semantic Web, the impact it has had thus far, and future challenges. We first review some of the external criticism of this vision that has been put forward by various authors; we draw together the individual critiques, arguing both for and against each point based on the current state of adoption. We then present the results of a questionnaire that we have posed to the Semantic Web mailing list in order to understand respondents’ perspective(s) regarding the degree to which the original Semantic Web vision has been realised, the impact it can potentially have on the Web (and other settings), its success stories thus far, as well as the degree to which they agree with the aforementioned critiques of the Semantic Web in terms of both its current state and future feasibility. We conclude by reflecting on future challenges and opportunities in the area.
“GOBERNANZA EN CIUDADES PORTUARIAS Aprendizajes desde el Área Metropolitana de Concepción”
This section analyzes the new challenges, analytical approaches, and methodologies for the study of different political elites. Based on the existing literature and the review of cases in Latin America, these articles reflect on the various research perspectives considering as dependent variables or phenomena to explain the governmental, legislative, and ministerial elites.
Graphs are by nature unifying abstractions that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena. Although real users and consumers of graph instances and graph workloads understand these abstractions, future problems will require new abstractions and systems. What needs to happen in the next decade for big graph processing to continue to succeed?
News media strongly influence how we picture public affairs across the world, playing a significant and sometimes controversial role in determining which topics are at the centre of public attention and action. Setting the Agenda, first published in 2004, has become the go-to textbook on this crucial topic.In this timely third edition, Maxwell McCombs – a pioneer of agenda-setting research – and Sebastián Valenzuela – a senior scholar of agenda setting in Latin America – have expanded and updated the book for a new generation of students. In describing the media’s influence on what we think about and how we think about it, Setting the Agenda also examines the sources of media agendas, the psychological explanation for their impact on the public agenda, and their consequences for attitudes, opinions and behaviours. New to this edition is a discussion of agenda setting in the widened media landscape, including a full chapter on network agenda setting and a lengthened presentation on agenda melding. The book also contains expanded material on social media and the role of agenda setting beyond the realm of public affairs, as well as a foreword from Donald L. Shaw and David H. Weaver, the co-founders of agenda-setting theory.This exciting new edition is an invaluable source for students of media, communications and politics, as well as those interested in the role of news in shaping and directing public opinion.
We focus on ontology-mediated queries (OMQs) based on (frontier-)guarded existential rules and (unions of) conjunctive queries, and we investigate the problem of FO-rewritability, i.e., whether an OMQ can be rewritten as a first-order query. We adopt two different approaches. The first approach employs standard two-way alternating parity tree automata. Although it does not lead to a tight complexity bound, it provides a transparent solution based on widely known tools. The second approach relies on a sophisticated automata model, known as cost automata. This allows us to show that our problem is 2ExpTime-complete. In both approaches, we provide semantic characterizations of FO-rewritability that are of independent interest.
Conjunctive query (CQ) evaluation is NP-complete, but becomes tractable for fragments of bounded hypertreewidth. Approximating a hard CQ by a query from such a fragment can thus allow for an efficient approximate evaluation. While underapproximations (i.e., approximations that return correct answers only) are well-understood, the dual notion of overapproximations (i.e, approximations that return complete – but not necessarily sound – answers), and also a more general notion of approximation based on the symmetric difference of query results, are almost unexplored. In fact, the decidability of the basic problems of evaluation, identification, and existence of those approximations has been open. This article establishes a connection between overapproximations and existential pebble games that allows for studying such problems systematically. Building on this connection, it is shown that the evaluation and identification problem for overapproximations can be solved in polynomial time. While the general existence problem remains open, the problem is shown to be decidable in 2EXPTIME over the class of acyclic CQs and in PTIME for Boolean CQs over binary schemata. Additionally we propose a more liberal notion of overapproximations to remedy the known shortcoming that queries might not have an overapproximation, and study how queries can be overapproximated in the presence of tuple generating and equality generating dependencies. The techniques are then extended to symmetric difference approximations and used to provide several complexity results for the identification, existence, and evaluation problem for this type of approximations.
Ontology-mediated querying and querying in the presence of constraints are two key database problems where tuple-generating dependencies (TGDs) play a central role. In ontology-mediated querying, TGDs can formalize the ontology and thus derive additional facts from the given data, while in querying in the presence of constraints, they restrict the set of admissible databases. In this work, we study the limits of efficient query evaluation in the context of the above two problems, focusing on guarded and frontier-guarded TGDs and on UCQs as the actual queries. We show that a class of ontology-mediated queries (OMQs) based on guarded TGDs can be evaluated in FPT iff the OMQs in the class are equivalent to OMQs in which the actual query has bounded treewidth, up to some reasonable assumptions. For querying in the presence of constraints, we consider classes of constraint-query specifications (CQSs) that bundle a set of constraints with an actual query. We show a dichotomy result for CQSs based on guarded TGDs that parallels the one for OMQs except that, additionally, FPT coincides with PTime combined complexity. The proof is based on a novel connection between OMQ and CQS evaluation. Using a direct proof, we also show a similar dichotomy result, again up to some reasonable assumptions, for CQSs based on frontier-guarded TGDs with a bounded number of atoms in TGD heads. Our results on CQSs can be viewed as extensions of Grohe’s well-known characterization of the tractable classes of CQs (without constraints). Like Grohe’s characterization, all the above results assume that the arity of relation symbols is bounded by a constant. We also study the associated meta problems, i.e., whether a given OMQ or CQS is equivalent to one in which the actual query has bounded treewidth.
This work deals with the problem of semantic optimization of the central class of conjunctive queries (CQs). Since CQ evaluation is NP-complete, a long line of research has focussed on identifying fragments of CQs that can be efficiently evaluated. One of the most general restrictions corresponds to generalized hypetreewidth bounded by a fixed constant k ≥ 1; the associated fragment is denoted GHWk. A CQ is semantically in GHWk if it is equivalent to a CQ in GHWk. The problem of checking whether a CQ is semantically in GHWk has been studied in the constraint-free case, and it has been shown to be NP-complete. However, in case the database is subject to constraints such as tuple-generating dependencies (TGDs) that can express, e.g., inclusion dependencies, or equality-generating dependencies (EGDs) that capture, e.g., key dependencies, a CQ may turn out to be semantically in GHWk under the constraints, while not being semantically in GHWk without the constraints. This opens avenues to new query optimization techniques. In this article, we initiate and develop the theory of semantic optimization of CQs under constraints. More precisely, we study the following natural problem: Given a CQ and a set of constraints, is the query semantically in GHWk, for a fixed k ≥ 1, under the constraints, or, in other words, is the query equivalent to one that belongs to GHWk over all those databases that satisfy the constraints? We show that, contrary to what one might expect, decidability of CQ containment is a necessary but not a sufficient condition for the decidability of the problem in question. In particular, we show that checking whether a CQ is semantically in GHW1 is undecidable in the presence of full TGDs (i.e., Datalog rules) or EGDs. In view of the above negative results, we focus on the main classes of TGDs for which CQ containment is decidable and that do not capture the class of full TGDs, i.e., guarded, non-recursive, and sticky sets of TGDs, and show that the problem in question is decidable, while its complexity coincides with the complexity of CQ containment. We also consider key dependencies over unary and binary relations, and we show that the problem in question is decidable in elementary time. Furthermore, we investigate whether being semantically in GHWk alleviates the cost of query evaluation. Finally, in case a CQ is not semantically in GHWk, we discuss how it can be approximated via a CQ that falls in GHWk in an optimal way. Such approximations might help finding “quick” answers to the input query when exact evaluation is intractable.
Article “Chile’s new interdisciplinary institute for foundational research on data” in Communications of the ACM.
This study proposes a three-way interaction model that examines how (1) partisan selective exposure to political information on social media, (2) information processing, and (3) ideology influenced support for Hillary Clinton and Donald Trump for president. Findings indicate that processing election information systematically affected support for Clinton among those who were exposed to diverse information; otherwise, heuristics were the main cue to process political information. Conservatives supporting Trump relied on heuristic processing and avoided information that challenged their beliefs. Liberals, in contrast, were more likely to systematically process election information, but the effect was significant only for those who exposed themselves to diverse information. As such, systematic processing might not make a difference in highly polarized environments, where strong partisans are unlikely to engage with different viewpoints and expose themselves to diverse information.
This study places the “cognitive elaboration model” on news gathering and political behavior within the dual-processing “elaboration likelihood model” to derive hypotheses about the effects of incidental news exposure and tests them using two-wave panel data. Results indicate incidental news exposure predicts online participation but not offline participation – underlining the importance of differentiating between political behaviors in the two environments. The key finding, however, is that news elaboration mediates the positive relationship between incidental exposure and political participation, which is theorized as taking place through the peripheral route of elaboration – as opposed to intentional exposure, which engages the central route.
We propose answer-set programs that specify and compute counterfactual interventions as a basis for causality-based explanations to decisions produced by classification models. They can be applied with black-box models and models that can be specified as logic programs, such as rule-based classifiers. The main focus is on the specification and computation of maximum responsibility causal explanations. The use of additional semantic knowledge is investigated.
This article shows an innovation project that aims contributing, from the ICT perspective, to necessities of health sector, specifically in interoperability and generation of information starting from distributed sources. For these purposes, the technological objective is to develop a standardized interoperable repository and intelligent applications, at prototype level, feasible to massify in order to contribute to the timely care of patients, through information generated by the processing of radiological images and reports associated with clinical records.
ext classification is a fairly explored task that has allowed dealing with a considerable amount of problems. However, one of its main difficulties is to conduct a learning process in data with class imbalance, i.e., datasets with only a few examples in some classes, which often represent the most interesting cases for the task. In this context, text classifiers overfit some particular classes, showing poor performance. To address this problem, we propose a scheme that combines the outputs of different classifiers, coding them in the encoder of a transformer. Feeding also a BERT encoding of each example, the encoder learns a joint representation of the text and the outputs of the classifiers. These encodings are used to train a new text classifier. Since the transformer is a highly complex model, we introduce a data augmentation technique, which allows the representation learning task to be driven without over-fitting the encoding to a particular class. The data augmentation technique also allows for producing a balanced dataset. The combination of both methods, representation learning, and data augmentation, allows improving the performance of trained classifiers. Results in benchmark data for two text classification tasks (stance classification and online harassment detection) show that the proposed scheme outperforms all of its direct competitors.
Respondent-Driven sampling (RDS) is a sampling method devised to overcome challenges with sampling hard-to-reach human populations. The sampling starts with a limited number of individuals who are asked to recruit a small number of their contacts. Every surveyed individual is subsequently given the same opportunity to recruit additional members of the target population until a pre-established sample size is achieved. The recruitment process consequently implies that the survey respondents are responsible for deciding who enters the study. Most RDS prevalence estimators assume that participants select among their contacts completely at random. The main objective of this work is to correct the inference for departure from this assumption, such as systematic recruitment based on the characteristics of the individuals or based on the nature of relationships. To accomplish this, we introduce three forms of non-random recruitment, provide estimators for these recruitment behaviors and extend three estimators and their associated variance procedures. The proposed methodology is assessed through a simulation study capturing various sampling and network features. Finally, the proposed methods are applied to a public health setting.
We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query q, we consider the following two problems: given as input an incomplete database D, (a) return the number of completions of D that satisfy q; or (b) return the number of valuations of the nulls of D yielding a completion that satisfies q. We obtain dichotomies between \#P-hardness and polynomial-time computability for these problems when q is a self-join-free conjunctive query, and study the impact on the complexity of the following two restrictions: (1) every null occurs at most once in D (what is called Codd tables); and (2) the domain of each null is the same. Roughly speaking, we show that counting completions is much harder than counting valuations: for instance, while the latter is always in \#P, we prove that the former is not in \#P under some widely believed theoretical complexity assumption. Moreover, we find that both (1) and (2) can reduce the complexity of our problems. We also study the approximability of these problems and show that, while counting valuations always has a fully polynomial-time randomized approximation scheme (FPRAS), in most cases counting completions does not. Finally, we consider more expressive query languages and situate our problems with respect to known complexity classes.
We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query q, we consider the following two problems: given as input an incomplete database D, (a) return the number of completions of D that satisfy q; or (b) return or the number of valuations of the nulls of D yielding a completion that satisfies q. We obtain dichotomies between #P-hardness and polynomial-time computability for these problems when q is a self-join-free conjunctive query, and study the impact on the complexity of the following two restrictions: (1) every null occurs at most once in D (what is called Codd tables); and (2) the domain of each null is the same. Roughly speaking, we show that counting completions is much harder than counting valuations (for instance, while the latter is always in #P, we prove that the former is not in #P under some widely believed theoretical complexity assumption). Moreover, we find that both (1) and (2) reduce the complexity of our problems. We also study the approximability of these problems and show that, while counting valuations always has a fully polynomial randomized approximation scheme, in most cases counting completions does not. Finally, we consider more expressive query languages and situate our problems with respect to known complexity classes.
Scores based on Shapley values are widely used for providing explanations to classification results over machine learning models. A prime example of this is the influential SHAP-score, a version of the Shapley value that can help explain the result of a learned model on a specific entity by assigning a score to every feature. While in general computing Shapley values is a computationally intractable problem, it has recently been claimed that the SHAP-score can be computed in polynomial time over the class of decision trees. In this paper, we provide a proof of a stronger result over Boolean models: the SHAP-score can be computed in polynomial time over deterministic and decomposable Boolean circuits. Such circuits, also known as tractable Boolean circuits, generalize a wide range of Boolean circuits and binary decision diagrams classes, including binary decision trees, Ordered Binary Decision Diagrams (OBDDs) and Free Binary Decision Diagrams (FBDDs). We also establish the computational limits of the notion of SHAP-score by observing that, under a mild condition, computing it over a class of Boolean models is always polynomially as hard as the model counting problem for that class. This implies that both determinism and decomposability are essential properties for the circuits that we consider, as removing one or the other renders the problem of computing the SHAP-score intractable (namely, #P-hard).
Descriptive Complexity has been very successful in characterizing complexity classes of decision problems in terms of the properties definable in some logics. However, descriptive complexity for counting complexity classes, such as FP and #P, has not been systematically studied, and it is not as developed as its decision counterpart. In this paper, we propose a framework based on Weighted Logics to address this issue. Specifically, by focusing on the natural numbers we obtain a logic called Quantitative Second Order Logics (QSO), and show how some of its fragments can be used to capture fundamental counting complexity classes such as FP, #P and FPSPACE, among others. We also use QSO to define a hierarchy inside #P, identifying counting complexity classes with good closure and approximation properties, and which admit natural complete problems. Finally, we add recursion to QSO, and show how this extension naturally captures lower counting complexity classes such as #L.Descriptive Complexity has been very successful in characterizing complexity classes of decision problems in terms of the properties definable in some logics. However, descriptive complexity for counting complexity classes, such as FP and #P, has not been systematically studied, and it is not as developed as its decision counterpart. In this paper, we propose a framework based on Weighted Logics to address this issue. Specifically, by focusing on the natural numbers we obtain a logic called Quantitative Second Order Logics (QSO), and show how some of its fragments can be used to capture fundamental counting complexity classes such as FP, #P and FPSPACE, among others. We also use QSO to define a hierarchy inside #P, identifying counting complexity classes with good closure and approximation properties, and which admit natural complete problems. Finally, we add recursion to QSO, and show how this extension naturally captures lower counting complexity classes such as #L.
This study takes advantage of a forceful legal 12-hour deprivation of access to WhatsApp messaging service nationwide in Brazil on 18 December 2015. Right after the blockage, we ran a survey to capture the reaction of Brazilians that were cut off from the App, aiming to understand which factors point to the capacity to successfully circumvent the blockage. Anxiety, digital skills and gender were found to be related to success, while isolation and age were not. Furthermore, a cluster analysis of 306 respondents that attempted to bypass the blockage identified four groups that summarize the reaction patterns in face of the blockage: the deprived, the challengers, the addicted and the elite. We discuss the possible implications of the findings for the field.
Despite the growing scholarship on investigative journalism in Latin America, very few studies have addressed collaboration across newsrooms in the region. By analyzing the responses of 251 journalists who work for investigative units in Latin American news outlets, this study explores a) the reasons why Latin American journalists are increasingly seeking to participate in national and transnational collaborative enterprises, b) the challenges they identify, and c) the role digital technologies are playing in this trend of transnational collaboration. Using mixed methods, we found that collaborations occur to enhance the impact of investigative projects, to reach larger audiences, and to achieve a big-picture coverage. We also found that safety is an important motivation to work in conjunction with other newsrooms—by collaborating, journalists are able to strengthen security measures and challenge censorship. Yet, coordinating teams—especially at the transnational level—remains the biggest challenge to overcome. Digital technologies are significantly related to reporters’ likelihood of collaborating, but these technologies require other reporting skills to be useful for investigative journalism. Implications for research and practice are discussed.
This study observes two relevant issues in today’s media ecosystem: incivility in online news comments and media bias during election periods. By analyzing 84 stories and 4670 comments published during the 2017 presidential election in Chile, we observed the extent to which news commenters addressed political figures using uncivil discourse, and the extent to which incivility and media bias were related in comments discussing the election. Results indicate incivility in comment sections of Chilean news outlets is higher than that found in the Global North, and the levels of uncivil speech are even higher when the conversation mentions female politicians, especially former president Michelle Bachelet. We also found a relationship between media bias and user bias—stories positively biased toward current president Sebastián Piñera were associated with more positive comments about him. Implications and future research are discussed.
In this work, we provide some insights and develop some ideas, with few technical details, about the role of explanations in Data Quality in the context of data-based machine learning models (ML). In this direction, there are, as expected, roles for causality, and explainable artificial intelligence. The latter area not only sheds light on the models, but also on the data that support model construction. There is also room for defining, identifying, and explaining errors in data, in particular, in ML, and also for suggesting repair actions. More generally, explanations can be used as a basis for defining dirty data in the context of ML, and measuring or quantifying them. We think dirtiness as relative to the ML task at hand, e.g., classification.
We propose a simple definition of an explanation for the outcome of a classifier based on concepts from causality. We compare it with previously proposed notions of explanation, and study their complexity. We conduct an experimental evaluation with two real datasets from the financial domain.
We propose answer-set programs that specify and compute counterfactual interventions as a basis for causality-based explanations to decisions produced by classification models. They can be applied with black-box models and models that can be specified as logic programs, such as rule-based classifiers. The main focus is on the specification and computation of maximum responsibility causal explanations. The use of additional semantic knowledge is investigated.
Despite the fact that JSON is currently one of the most popular formats for exchanging data on the Web, there are very few studies on this topic and there is no agreement upon a theoretical framework for dealing with JSON. Therefore in this paper we propose a formal data model for JSON documents and, based on the common features present in available systems using JSON, we define a lightweight query language allowing us to navigate through JSON documents, study the complexity of basic computational tasks associated with this language, and compare its expressive power with practical languages for managing JSON data.
We investigate gradual variations on the Calculus of Inductive Construction (CIC) for swifter prototyping with imprecise types and terms. We observe, with a no-go theorem, a crucial tradeoff between graduality and the key properties of normalization and closure of universes under dependent product that CIC enjoys. Beyond this Fire Triangle of Graduality, we explore the gradualization of CIC with three different compromises, each relaxing one edge of the Fire Triangle. We develop a parametrized presentation of Gradual CIC (GCIC) that encompasses all three variations, and develop their metatheory. We first present a bidirectional elaboration of GCIC to a dependently-typed cast calculus, CastCIC, which elucidates the interrelation between typing, conversion, and the gradual guarantees. We use a syntactic model of CastCIC to inform the design of a safe, confluent reduction, and establish, when applicable, normalization. We study the static and dynamic gradual guarantees as well as the stronger notion of graduality with embedding-projection pairs formulated by New and Ahmed, using appropriate semantic model constructions. This work informs and paves the way towards the development of malleable proof assistants and dependently-typed programming languages.
Article Three Success Stories About Compact Data Structures in Communications of the ACM, Latin America Region Special Section: Hot Topics.
Every year physicians face an increasing demand of image-based diagnosis from patients, a problem that can be addressed with recent artificial intelligence methods. In this context, we survey works in the area of automatic report generation from medical images, with emphasis on methods using deep neural networks, with respect to: (1) Datasets, (2) Architecture Design, (3) Explainability and (4) Evaluation Metrics. Our survey identifies interesting developments, but also remaining challenges. Among them, the current evaluation of generated reports is especially weak, since it mostly relies on traditional Natural Language Processing (NLP) metrics, which do not accurately capture medical correctness.
The video game industry has adopted recommendation systems to boost users interest with a focus on game sales. Other exciting applications within video games are those that help the player make decisions that would maximize their playing experience, which is a desirable feature in real-time strategy video games such as Multiplayer Online Battle Arena (MOBA) like as DotA and LoL. Among these tasks, the recommendation of items is challenging, given both the contextual nature of the game and how it exposes the dependence on the formation of each team. Existing works on this topic do not take advantage of all the available contextual match data and dismiss potentially valuable information. To address this problem we develop TTIR, a contextual recommender model derived from the Transformer neural architecture that suggests a set of items to every team member, based on the contexts of teams and roles that describe the match. TTIR outperforms several approaches and provides interpretable recommendations through visualization of attention weights. Our evaluation indicates that both the Transformer architecture and the contextual information are essential to get the best results for this item recommendation task. Furthermore, a preliminary user survey indicates the usefulness of attention weights for explaining recommendations as well as ideas for future work. The code and dataset are available at https://github.com/ojedaf/IC-TIR-Lol .
Knowledge Graphs can be considered as fulfilling an early vision in Computer Science of creating intelligent systems that integrate knowledge and data at large scale. Stemming from scientific advancements in research areas of Semantic Web, Databases, Knowledge representation, NLP, Machine Learning, among others, Knowledge Graphs have rapidly gained popularity in academia and industry in the past years. The integration of such disparate disciplines and techniques give the richness to Knowledge Graphs, but also present the challenge to practitioners and theoreticians to know how current advances develop from early techniques in order, on one hand, take full advantage of them, and on the other, avoid reinventing the wheel. This tutorial will provide a historical context on the roots of Knowledge Graphs grounded in the advancements of Logic, Data and the combination thereof.
As the adoption of knowledge graphs grows, more and more non-experts users need to be able to explore and query such graphs. These users are not typically familiar with graph query languages such as SPARQL, and may not be familiar with the knowledge graph’s structure. In this extended abstract, we provide a summary of our work on a language and visual interface — called RDF Explorer — that help non-expert users to navigate and query knowledge graphs. A usability study over Wikidata shows that users successfully complete more tasks with RDF Explorer than with the existing Wikidata Query Helper interface.
Detection of positive selection signatures in populations around the world is helping to uncover recent human evolutionary history as well as the genetic basis of diseases. Most human evolutionary genomic studies have been performed in European, African, and Asian populations. However, populations with Native American ancestry have been largely underrepresented. Here, we used a genome-wide local ancestry enrichment approach complemented with neutral simulations to identify postadmixture adaptations underwent by admixed Chileans through gene flow from Europeans into local Native Americans. The top significant hits (P = 2.4×10−7) are variants in a region on chromosome 12 comprising multiple regulatory elements. This region includes rs12821256, which regulates the expression of KITLG, a well-known gene involved in lighter hair and skin pigmentation in Europeans as well as in thermogenesis. Another variant from that region is associated with the long noncoding RNA RP11-13A1.1, which has been specifically involved in the innate immune response against infectious pathogens. Our results suggest that these genes were relevant for adaptation in Chileans following the Columbian exchange.
We present the first version of the ALeRCE (Automatic Learning for the Rapid Classification of Events) broker light curve classifier. ALeRCE is currently processing the Zwicky Transient Facility (ZTF) alert stream, in preparation for the Vera C. Rubin Observatory. The ALeRCE light curve classifier uses variability features computed from the ZTF alert stream, and colors obtained from AllWISE and ZTF photometry. We apply a Balanced Random Forest algorithm with a two-level scheme, where the top level classifies each source as periodic, stochastic, or transient, and the bottom level further resolves each of these hierarchical classes, amongst 15 total classes. This classifier corresponds to the first attempt to classify multiple classes of stochastic variables (including core- and host-dominated active galactic nuclei, blazars, young stellar objects, and cataclysmic variables) in addition to different classes of periodic and transient sources, using real data. We created a labeled set using various public catalogs (such as the Catalina Surveys and {\em Gaia} DR2 variable stars catalogs, and the Million Quasars catalog), and we classify all objects with ≥6 g-band or ≥6 r-band detections in ZTF (868,371 sources as of 2020/06/09), providing updated classifications for sources with new alerts every day. For the top level we obtain macro-averaged precision and recall scores of 0.96 and 0.99, respectively, and for the bottom level we obtain macro-averaged precision and recall scores of 0.57 and 0.76, respectively.
This paper presents the methods that have participated in the SHREC’20 contest on retrieval of surface patches with similar geometric reliefs and the analysis of their performance over the benchmark created for this challenge. The goal of the context is to verify the possibility of retrieving 3D models only based on the reliefs that are present on their surface and to compare methods that are suitable for this task. This problem is related to many real world applications, such as the classification of cultural heritage goods or the analysis of different materials. To address this challenge, it is necessary to characterize the local ”geometric pattern” information, possibly forgetting model size and bending. Seven groups participated in this contest and twenty runs were submitted for evaluation. The performances of the methods reveal that good results are achieved with a number of techniques that use different approaches.
As an essential high-level task of video understanding topic, automatically describing a video with natural language has recently gained attention as a fundamental challenge in computer vision. Previous models for video captioning have several limitations, such as the existence of gaps in current semantic representations and the inexpressibility of the generated captions. To deal with these limitations, in this paper, we present a new architecture that we call Attentive Visual Semantic Specialized Network (AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. Besides, the model is capable of learning to improve the expressiveness of generated captions attending to their length, using a sentence-length-related loss function. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description (MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGEL.
Memes are becoming a useful source of data for analyzing behavior on social media. However, a problem to tackle is how to correctly identify a meme. As the number of memes published every day on social media is huge, there is a need for automatic methods for classifying and searching in large meme datasets. This paper proposes and compares several methods for automatically classifying images as memes. Also, we propose a method that allows us to implement a system for retrieving memes from a dataset using a textual query. We experimentally evaluate the methods using a large dataset of memes collected from Twitter users in Chile, which was annotated by a group of experts. Though some of the evaluated methods are effective, there is still room for improvement.
The analysis of network robustness tackles the problem of studying how a complex network behaves under adverse scenarios, such as failures or attacks. In particular, the analysis of interdependent networks’ robustness focuses on the specific case of the robustness of interacting networks and their emerging behaviors. This survey systematically reviews literature of frameworks that analyze the robustness of interdependent networks published between 2005 and 2017. This review shows that there exists a broad range of interdependent network models, robustness metrics, and studies that can be used to understand the behaviour of different systems under failure or attack. Regarding models, we found that there is a focus on systems where a node in one layer interacts with exactly one node at another layer. In studies, we observed a focus on the network percolation. While among the metrics, we observed a focus on measures that count network elements. Finally, for the networks used to test the frameworks, we found that the focus was on synthetic models, rather than analysis of real network systems. This review suggests opportunities in network research, such as the study of robustness on interdependent networks with multiple interactions and/or spatially embedded networks, and the use of interdependent network models in realistic network scenarios.
Adolescence is a key period in the development of individuals’ news habits, but little is known about the processes involved in the process of news media socialization. This study proposes an integrated model in which the influence of family communication on motivations and behaviors of adolescents in relation to news consumption occurs through the development of personality traits related to information processing (namely, need for cognition and need to evaluate). Structural equation modeling of data from a representative survey of 2,273 adolescents, aged 13 to 17, provide support for the theorized model, such that concept-oriented communication within families is associated to news exposure indirectly, via personality traits and motivations. Thus, the study provides an initial assessment of one way children are socialized to become news enthusiasts and news avoiders. It also provides empirical evidence that information-processing traits are influenced by family communication patterns, confirming what hitherto was theoretical speculation.
We introduce a new family of compressed data structures to efficiently store and query large string dictionaries in main memory. Our main technique is a combination of hierarchical Front-coding with ideas from longest-common-prefix computation in suffix arrays. Our data structures yield relevant space-time tradeoffs in real-world dictionaries. We focus on two domains where string dictionaries are extensively used and efficient compression is required: URL collections, a key element in Web graphs and applications such as Web mining; and collections of URIs and literals, the basic components of RDF datasets. Our experiments show that our data structures achieve better compression than the state-of-the-art alternatives while providing very competitive query times.
The analysis of painted pottery is instrumental for understanding ancient Greek society and human behavior of past cultures in Archaeology. A key part of this analysis is the discovery of cross references to establish links and correspondences. However, due to the vast amount of documented images and 3D scans of pottery objects in today’s domain repositories, manual search is very time consuming. Computer aided retrieval methods are of increasing importance. Mostly, current retrieval systems for this kind of cultural heritage data only allow to search for pottery of similar vessel’s shape. However, in many cases important similarity cues are given by motifs painted on these vessels. We present an interactive retrieval system that makes use of this information to allow for a motif-driven search in cultural heritage repositories. We address the problem of unsupervised motif extraction for preprocessing and the shape-based similarity search for Greek painted pottery. Our experimental evaluation on relevant repository data demonstrates effectiveness of our approach on examples of different motifs of interests.
Covering a set of points with k orthogonal bounding boxes is useful for implementing Spatio-temporal index structures that are built from a given dataset. In this work, we deal with the problem of covering a set of points with k-parallel axis boxes, under the restriction that the total area enclosed by the boxes must be minimized. To achieve this, we present a novel algorithm that, using dynamic programming techniques, finds the optimal solution for covering a set of points with k-bounding boxes where the total sum of the areas of the boxes is minimum. This is compared with the process of generating k-bounding boxes every l units of distance, achieving an improvement of about 50% of the unuseful area covered
We introduce an efficient approach that aims at supporting content-based queries on the Chilean Virtual Observatory. In particular we are interested in retrieving relevant information from virtual-observatory tables. This introduces several challenges that make the information-retrieval process harder. We define an algorithm that uses a compressed data structure to obtain the count of the number of occurrences of a string query within each column of a table. This kind of query has been used in the literature for faceted and semantic search as well as for retrieving information from web tables. This is in order to improve search effectiveness. We show that using only 15%-25% the space of a table our approach contains the table data (and hence the table can be deleted) and is able to answer queries efficiently in a few milliseconds.
For a sorted set S from a universe [1..u] received under the streaming model (i.e., elements are received one at a time, in sorted order), such that at a given time it contains n elements {x 1 , . . . , x n }, and whose characteristic bit vector is C S = 0(σ 1 )11···10(σ 2 )11···1 · · · 0(σ g )11···1 (i.e., the set elements are actually arranged in g <; n intervals of size ≥ 1), we propose a compact data structure that answers operations select and rank in Θ(lg(g/ lg g)) worst-case time, and append in O(1) amortized time, using 2g lg u-n/g +g lg n/g +o(g lg lg g) bits of space. The structure is suitable in cases where g ≤ n/2.
The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with an unambiguous identifier in a Knowledge Base. While much work has been done on the topic, we first present the results of a survey that reveal a lack of consensus in the community regarding what forms of mentions in a text and what forms of links the EL task should consider. We argue that no one definition of the Entity Linking task fits all, and rather propose a fine-grained categorization of different types of entity mentions and links. We then re-annotate three EL benchmark datasets – ACE2004, KORE50, and VoxEL – with respect to these categories. We propose a fuzzy recall metric to address the lack of consensus and conclude with fine-grained evaluation results comparing a selection of online EL systems.
CNN-based models currently provide state-of-the-art performance in image categorization tasks. While these methods are powerful in terms of representational capacity, they are generally not conceived with explicit means to control complexity. This might lead to scenarios where resources are used in a non-optimal manner, increasing the number of unspecialized or repeated neurons, and overfitting to data. In this work we propose CompactNets, a new approach to visual recognition that learns a hierarchy of shared, discriminative, specialized, and compact representations. CompactNets naturally capture the notion of compositional compactness, a characterization of complexity in compositional models, consisting on using the smallest number of patterns to build a suitable visual representation. We employ a structural regularizer with group-sparse terms in the objective function, that induces on each layer, an efficient and effective use of elements from the layer below. In particular, this allows groups of top-level features to be specialized based on category information. We evaluate CompactNets on the ILSVRC12 dataset, obtaining compact representations and competitive performance, using an order of magnitude less parameters than common CNN-based approaches. We show that CompactNets are able to outperform other group-sparse-based approaches, in terms of performance and compactness. Finally, transfer-learning experiments on small-scale datasets demonstrate high generalization power, providing remarkable categorization performance with respect to alternative approaches.
https://doi.org/10.1016/j.cviu.2019.102841
The article presents SHACL2SPARQL, a tool that validates an RDF graph stored as a SPARQL endpoint against possibly recursive SHACL constraints. It is based on the algorithm proposed in [3]. This implementation improves upon the original algorithm with a wider range of natively supported constraint operators, SPARQL query optimization techniques, and a mechanism to explain invalid targets.
We address the problem of estimating when the results of an input SPARQL query over dynamic RDF datasets will change. We evaluate a framework that extracts features from the query and/or from past versions of the target dataset and inputs them into binary classifiers to predict whether or not the results for a query will change at a fixed point in the near future. For this evaluation, we create a gold standard based on 23 versions of Wikidata and a curated collection of 221 SPARQL queries. Our results show that the quality of predictions possible using (only) features based on the query structure and lightweight statistics of the predicate dynamics – though capable of beating a random baseline – are not competitive with results obtained using (more costly to derive) knowledge of the complete historical changes in the query results.
Link: http://ceur-ws.org/Vol-2496/paper1.pdf
Current data processing tasks require efficient approaches capable of dealing with large databases. A promising strategy consists in distributing the data along with several computers that partially solve the undertaken problem. Finally, these partial answers are integrated to obtain a final solution. We introduce distributed shared nearest neighbors (D-SNN), a novel clustering algorithm that work with disjoint partitions of data. Our algorithm produces a global clustering solution that achieves a competitive performance regarding centralized approaches. The algorithm works effectively with high dimensional data, being advisable for document clustering tasks. Experimental results over five data sets show that our proposal is competitive in terms of quality performance measures when compared to state of the art methods.
Stance classification is the task of automatically identify the user’s positions about a specific topic. The classification of stance may help to understand how people react to a piece of target information, a task that is interesting in different areas as advertising campaigns, brand analytics, and fake news detection, among others. The rise of social media has put into the focus of this task the classification of stance in online social networks. A number of methods have been designed for this purpose showing that this problem is hard and challenging. In this work, we explore how to use self-attention models for stance classification. Instead of using attention mechanisms to learn directly from the text we use self-attention to combine different baselines’ outputs. For a given post, we use the transformer architecture to encode each baseline output exploiting relationships between baselines and posts. Then, the transformer learns how to combine the outputs of these methods reaching a consistently better classification than the ones provided by the baselines. We conclude that self-attention models are helpful to learn from baselines’ outputs in a stance classification task.
Let f:[1..n]→[1..n] be a function, and ℓ:[1..n]→[1..σ] indicate a label assigned to each element of the domain. We design several compact data structures that answer various kinds of summary queries on the labels of paths in f. For example, we can find either the minimum label in fk(i) for a given i and any k≥0 in a given range [k1..k2], or the minimum label in f−k(i) for a given i and k>0, using nlgn+nlgσ+o(nlgn) bits and time O(α(n)), the inverse Ackermann function. Within similar space we can count, in time O(lgn/lglgn), the number of labels within a range, and report each element with such labels in O(lgn/lglgn) additional time. Several other tradeoffs and possible queries are considered, such as selection, top-r queries and τ-majorities. Finally, we consider queries that allow us navigate on the graph of the function, such as the nearest common successor of two elements, or the nearest successor or predecessor of an element within a range of labels.
SHACL (Shapes Constraint Language) is a specification for describing and validating RDF graphs that has recently become a W3C recommendation. While the language is gaining traction in the industry, algorithms for SHACL constraint validation are still at an early stage. A first challenge comes from the fact that RDF graphs are often exposed as SPARQL endpoints, and therefore only accessible via queries. Another difficulty is the absence of guidelines about the way recursive constraints should be handled. In this paper, we provide algorithms for validating a graph against a SHACL schema, which can be executed over a SPARQL endpoint. We first investigate the possibility of validating a graph through a single query for non-recursive constraints. Then for the recursive case, since the problem has been shown to be NP-hard, we propose a strategy that consists in evaluating a small number of SPARQL queries over the endpoint, and using the answers to build a set of propositional formulas that are passed to a SAT solver. Finally, we show that the process can be optimized when dealing with recursive but tractable fragments of SHACL, without the need for an external solver. We also present a proof-of-concept evaluation of this last approach.
Six datasets have been published under the title of Billion Triple Challenge (BTC) since 2008. Each such dataset contains billions of triples extracted from millions of documents crawed from hundreds of domains. While these datasets were originally motivated by the annual ISWC competition from which they take their name, they would become widely used in other contexts, forming a key resource for a variety of research works concerned with managing and/or analysing diverse, real-world RDF data as found natively on the Web. Given that the last BTC dataset was published in 2014, we prepare and publish a new version – BTC-2019 – containing 2.2 billion quads parsed from 2.6 million documents on 394 pay-level-domains. This paper first motivates the BTC datasets with a survey of research works using these datasets. Next we provide details of how the BTC-2019 crawl was configured. We then present and discuss a variety of statistics that aim to gain insights into the content of BTC-2019. We discuss the hosting of the dataset and the ways in which it can be accessed, remixed and used.
LINK: https://doi.org/10.1007/978-3-030-30796-7_11
RESOURCE DOI: https://doi.org/10.5281/zenodo.2634588
Despite the growing popularity of knowledge graphs for managing diverse data at large scale, users who wish to pose expressive queries against such graphs are often expected to know (i) how to formulate queries in a language such as SPARQL, and (ii) how entities of interest are described in the graph. In this paper we propose a language that relaxes these expectations; the language’s operators are based on an interactive graph-based exploration that allows non-expert users to simultaneously navigate and query knowledge graphs; we compare the expressivity of this language with SPARQL. We then discuss an implementation of this language that we call RDF Explorer and discuss various desirable properties it has, such as avoiding interactions that lead to empty results. Through a user study over the Wikidata knowledge-graph, we show that users successfully complete more tasks with RDF Explorer than with the existing Wikidata Query Helper, while a usability questionnaire demonstrates that users generally prefer our tool and self-report lower levels of frustration and mental effort.
Worst-case optimal multiway join algorithms have recently gained a lot of attention in the database literature. These algorithms not only offer strong theoretical guarantees of efficiency but have also been empirically demonstrated to significantly improve query runtimes for relational and graph databases. Despite these promising theoretical and practical results, however, the Semantic Web community has yet to adopt such techniques; to the best of our knowledge, no native RDF database currently supports such join algorithms, wherein this paper we demonstrate that this should change. We propose a novel procedure for evaluating SPARQL queries based on an existing worst-case join algorithm called Leapfrog Triejoin. We propose an adaptation of this algorithm for evaluating SPARQL queries and implement it in Apache Jena. We then present experiments over the Berlin and WatDiv SPARQL benchmarks, and a novel benchmark that we propose based on Wikidata that is designed to provide insights into join performance for a more diverse set of basic graph patterns. Our results show that with this new join algorithm, Apache Jena often runs orders of magnitude faster than the base version and two other SPARQL engines: Virtuoso and Blazegraph.
DOI https://doi.org/10.1007/978-3-030-30793-6_15
For those of us who regularly conduct public opinion research outside of the United States and Europe, it is customary to have to explain whether our findings are “real,” that is, generalizable relationships that advance theory, or some kind of contextual artifact. Infamous Reviewer 2 will ask for an explanation of how context might be affecting the relationships that we are describing, and while it might be irritating to do so, in this case, Reviewer 2 is right. The issue of course is not having to explain how contexts matter, but instead why scholarship examining the US, or certain western countries, is not consistently subject to the same task. In this piece, we advocate for contextualizing public opinion research in all cases, because of course relationships among variables are always context/historic dependent. Rather than being a theoretical shortcoming, we argue that this becomes a theoretical strength: being able to identify the conditions under which our proposed relationships hold and those in which they do not. So, for example, rather than just saying that news consumption is positively related with political participation, as scholars including ourselves have been doing for years, we need to make explicit the news construction conditions under which this is the case, the participatory repertoire being considered, and normative implications of our claims. To engage contextualization, cross-national, cross cultural, cross group and historical comparisons are particularly useful. To build our case for the increasing need for contextualization in political communication research, we will first examine some early comparative research, then we will show some current problematic comparisons, and finally will end with some concluding remarks of the challenges for the field that lie ahead and the benefits of a contextual approach.
https://doi.org/10.1080/10584609.2019.1670897
We show that the topological model, a semantically rich standard to represent GIS data, can be encoded succinctly while efficiently answering a number of topology-related queries. We build on recent succinct planar graph representations so as to encode a model with m edges within 4m+o(m)4m+o(m) bits and answer various queries relating nodes, edges, and faces in o(log log m)o(log log m) time, or any time in ω(logm)ω(logm) for a few complex ones.
This paper proposes a practical implementation of an alphabet-partitioning compressed data structure, which represents a string within compressed space and supports the fundamental operations rank and select efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach (which is one of the most efficient approaches in practice). In particular, the time for operation select can be reduced by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), and the distributed-computation processing of rank and select operations. As far as we know, this is the first study about the support of rank/select operations on a distributed-computing environme03
Representing a static set of integers S, |S|=n|S|=n from a finite universe U=[1..u]U=[1..u] is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of rank and select on S; if S is viewed as its characteristic vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures.
Although there is an information-theoretic lower bound of B(n,u)=lg(un)B(n,u)=lg(un) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain non-trivial runs of consecutive elements, one that occurs in many practical situations.
Let CnCn denote the class of (un)(un) distinct sets of nn elements over the universe [1..u][1..u]. Let also Cng⊂CnCgn⊂Cn contain the sets whose nn elements are arranged in g≤ng≤n runs of ℓi≥1ℓi≥1 consecutive element from U for i=1,…,gi=1,…,g, and let Cng,r⊂CngCg,rn⊂Cgn contain all sets that consist of g runs, such that r≤gr≤g of them have at least 2 elements.
-
We introduce new compressibility measures for sets, including:
-
L1=lg|Cng|=lg(u−n+1g)+lg(n−1g−1)L1=lg|Cgn|=lg(u−n+1g)+lg(n−1g−1) and
-
L2=lg|Cng,r|=lg(u−n+1g)+lg(n−g−1r−1)+lg(gr)L2=lg|Cg,rn|=lg(u−n+1g)+lg(n−g−1r−1)+lg(gr)
We show that L2≤L1≤B(n,u)L2≤L1≤B(n,u).
-
-
We give data structures that use space close to bounds L1L1 and L2L2 and support rank and select in O(1) time.
-
We provide additional measures involving entropy-coding run lengths and gaps between items, data structures to support these measures, and show experimentally that these approaches are promising for real-world datasets.
The 𝑘²-tree is a successful compact representation of binary relations that exhibit sparseness and/or clustering properties. It can be extended to d dimensions, where it is called a 𝑘𝑑-tree. The representation boils down to a long bitvector. We show that interpreting the 𝑘𝑑-tree as a dynamic trie on the Morton codes of the points, instead of as a dynamic representation of the bitvector as done in previous work, yields operation times that are below the lower bound of dynamic bitvectors and offers improved time performance in practice.
Radical populists in the Andes have combined a populist program and a multicultural agenda. However, while populism centralizes power in the hands of the leader and emphasizes the unity of the people, multiculturalism grants cultural rights that strengthen societal autonomy, generating an inherent tension between these two modes of incorporation. How are populist governments able to combine unity and fragmentation as well as centralization and autonomy? This article develops the concept of populist multiculturalism, focusing on the Movimiento al Socialismo (MAS) in Bolivia, which has supported autonomy rights while simultaneously curtailing their implementation. Specifically, it examines the implementation of indigenous autonomous governments and prior consultation and the relationship between indigenous organizations and the ruling party. The article also extends this concept to Ecuador and Venezuela.
The semantic network of a dictionary is a mathematical structure that represents relationships among words of a language. In this work, we study the evolution of the semantic network of the Spanish dictionary during the last century, beginning in 1925 until 2014. We analysed the permanence and changes of its structural properties, such as size of components, average shortest path length, and degree distribution. We found that global structural properties of the Spanish dictionary network are remarkably stable. In fact, if we remove all the labels from the network, networks from different editions of the Spanish dictionary are practically indistinguishable. On the other hand, local properties change over the years offering insights about the evolution of lexicon. For instance, the neighbourhood of a single word or the shared neighbourhood between a pair of words. This paper presents preliminary evidence that dictionary networks are an interesting language tool and good proxies to study semantic clouds of words and their evolution in a given language.
LINK: https://elex.link/elex2019/wp-content/uploads/2019/09/eLex_2019_47.pdf
Political parties with activists are in decline due to various external shocks. Societal changes, like the emergence of new technologies of communication, have diminished the role and number of activists, while party elites increasingly can make do without grassroots activists. However, recent scholarship concerning different democracies has shown how activism still matters for representation. This book contributes to this literature by analyzing the unique case of the Uruguayan Frente Amplio (FA), the only mass-organic, institutionalized leftist party in Latin America. Using thick description, systematic process tracing, and survey research, this case study highlights the value of an organization-centered approach for understanding parties’ role in democracy. Within the FA, organizational rules grant activists a significant voice, which imbues activists’ participation with a strong sense of efficacy. This book is an excellent resource for scholars and students of Latin America and comparative politics who are interested in political parties and the challenges confronting new democracies.
To determine whether a document or a sentence expresses a positive or negative sentiment, three main approaches are commonly used: the lexicon-based approach, corpus-based approach, and a hybrid approach. The study of sentiment analysis in English has the highest number of sentiment analysis studies, while research is more limited for other languages, including Arabic and its dialects. Lexicon based approaches need annotated sentiment lexicons (containing the valence and intensity of its terms and expressions). Corpus-based sentiment analysis requires annotated sentences. One of the significant problems related to the treatment of Arabic and its dialects is the lack of these resources. We present in this survey the most recent resources and advances that have been done for Arabic sentiment analysis. This survey presents recent work (where the majority of these works are between 2015 and 2019). These works are classified by category (survey work or contribution work). For contribution work, we focus on the construction of sentiment lexicon and corpus. We also describe emergent trends related to Arabic sentiment analysis, principally associated with the use of deep learning techniques.
Given an arbitrary set A of two-dimensional points over a totally-ordered domain, a two-sided planar range query consists on finding all points of A within an arbitrary quadrant. In this paper we present a novel data structure that uses linear space in |A| while allowing for two-dimensional orthogonal range queries with logarithmic pre-processing and constant-delay enumeration.
E-Sports has been positioned as an important activity within MOBA (Multiplayer Online Battle Arena) games in recent years. There is existing research on recommender systems in this topic, but most of it focuses on the character recommendation problem. However, the recommendation of items is also challenging because of its contextual nature, depending on the other characters. We have developed a framework that suggests items for a character based on the match context. The system aims to help players who have recently started the game as well as frequent players to take strategic advantage during a match and to improve their purchasing decision making. By analyzing a dataset of ranked matches through data mining techniques, we can capture purchase dynamic of experienced players to use it to generate recommendations. The results show that our proposed solution yields up to 80% of mAP, suggesting that the method leverages context information successfully. These results, together with open issues we mention in the paper, call for further research in the area.
Clustering-based recommender systems bound the seek of similar users within small user clusters providing fast recommendations in large-scale datasets. Then groups can naturally be distributed into different data partitions scaling up in the number of users the recommender system can handle. Unfortunately, while the number of users and items included in a cluster solution increases, the performance in terms of precision of a clustering-based recommender system decreases. We present a novel approach that introduces a cluster-based distance function used for neighborhood computation. In our approach, clusters generated from the training data provide the basis for neighborhood selection. Then, to expand the search of relevant users, we use a novel measure that can exploit the global cluster structure to infer cluster-outside user’s distances. Empirical studies on five widely known benchmark datasets show that our proposal is very competitive in terms of precision, recall, and NDCG. However, the strongest point of our method relies on scalability, reaching speedups of 20× in a sequential computing evaluation framework and up to 100× in a parallel architecture. These results show that an efficient implementation of our cluster-based CF method can handle very large datasets providing also good results in terms of precision, avoiding the high computational costs involved in the application of more sophisticated techniques.
Deep learning is a branch of machine learning that generates multi-layered representations of data, commonly using artificial neural networks, and has improved the state-of-the-art in various machine learning tasks (e.g., image classification, object detection, speech recognition, and document classification). However, most popular deep learning frameworks such as TensorFlow and PyTorch require users to write code to apply deep learning. We present WekaDeeplearning4j, a Weka package that makes deep learning accessible through a graphical user interface (GUI). The package uses Deeplearning4j as its backend, provides GPU support, and enables GUI-based training of deep neural networks such as convolutional and recurrent neural networks. It also provides pre-processing functionality for image and text data.
Inorganic arsenic (As) is a toxic xenobiotic and carcinogen associated with severe health conditions. The urban population from the Atacama Desert in northern Chile was exposed to extremely high As levels (up to 600 µg/l) in drinking water between 1958 and 1971, leading to increased incidence of urinary bladder cancer (BC), skin cancer, kidney cancer, and coronary thrombosis decades later. Besides, the Andean Native-American ancestors of the Atacama population were previously exposed for millennia to elevated As levels in water (∼120 µg/l) for at least 5,000 years, suggesting adaptation to this selective pressure. Here, we performed two genome-wide selection tests—PBSn1 and an ancestry-enrichment test—in an admixed population from Atacama, to identify adaptation signatures to As exposure acquired before and after admixture with Europeans, respectively. The top second variant selected by PBSn1 was associated with LCE4A-C1orf68, a gene that may be involved in the immune barrier of the epithelium during BC. We performed association tests between the top PBSn1 hits and BC occurrence in our population. The strongest association (P = 0.012) was achieved by the LCE4A-C1orf68 variant. The ancestry-enrichment test detected highly significant signals (P = 1.3 × 10−9) mapping MAK16, a gene with important roles in ribosome biogenesis during the G1 phase of the cell cycle. Our results contribute to a better understanding of the genetic factors involved in adaptation to the pathophysiological consequences of exposure.
The raster model is commonly used for the representation of images in many domains and is especially useful in Geographic Information Systems (GIS) to store information about continuous variables of the space (elevation, temperature, etc.). Current representations of raster data are usually designed for external memory or, when stored in main memory, lack efficient query capabilities. In this paper, we propose compact representations to efficiently store and query raster datasets in the main memory. We present different representations for binary raster data, general raster data, and time-evolving raster data. We experimentally compare our proposals with traditional storage mechanisms such as linear quadtrees or compressed GeoTIFF files. Results show that our structures are up to 10 times smaller than classical linear quadtrees, and even comparable in space to non-queriable representations of raster data, while efficiently answering a number of typical queries.
Interactive technologies are changing the ways we learn facts, develop attitudes and participate in politics, with the ensuing risk of increasing pre-existing inequalities. Addressing this challenge is the duty of researchers, technology companies, governments and news organizations.
This article deals with fieldwork in challenging research contexts that make preparation for field research particularly difficult. Challenging contexts include generally insecure places, politicized contexts, and unknown settings. Drawing on our experience in the field, we discuss four challenges that are common across these contexts: access, positionality, researcher well-being, and research design, and data collection. Bringing together insights from fieldwork with urban elites and in the countryside, this paper describes problems that occurred in both settings and identifies a set of interpersonal skills that helped the authors to tackle the challenges of the field and seize the opportunities it offered. This article posits that recognizing the importance of certain interpersonal skills, namely: openness, empathy, humility, and flexibility, precedes the identification of practical tools. Interpersonal skills, instead, focus on a general attitude that underlies researchers’ capacity to make informed choices about specific courses of action, preparing fieldworkers to be prepared to confront problems once they arise.
This paper describes a submission to the Word-in-Context competition for the IJCAI 2019 SemDeep-5 workshop. The task is to determine whether a given focus word is used in the same or different senses in two contexts. We took an ELMo-inspired approach similar to the baseline model in the task description paper, where contextualized representations are obtained for the focus words and a classification is made according to the degree of similarity between these representations. Our model had a few simple differences, notably joint training of the forward and backward LSTMs, a different choice of states for the contextualized representations and a new similarity measure for them. These changes yielded a 3.5% improvement on the ELMo baseline.
Introduction: Deep Neural Networks (DNNs) are a type of machine learning model (Goodfellow et al, 2016) which have reported state-of-the-art results in several tasks in the past years. Despite the impressive results reported by these models in several fields such as computer vision (Krizhevsky et al., 2014), natural language processing (Mikolov et al., 2013) or recommender systems (Covington et al., 2016), one of their biggest drawbacks is their lack of interpretability and transparency. Some of the best performing DNN models have millions of parameters, so making sense of what these models learn is an active research challenge. These algorithms can help to solve and automate difficult and expensive tasks, but their adoption in critical domains, which usually requires liability, depends on making their decision interpretable by humans. Some large funding programs such as DARPA XAI (Gunning and Aha, 2019) are addressing this problem, providing evidence of their importance. On the other side, recent legislation such as Europe’s GDPR gives people the right to explainability of automated decisions regarding their private data.
One of the most significant techniques introduced to DNNs in the latest years is the so called attention mechanism (Larrochelle and Hinton, 2010). The idea is inspired by our visual system, since humans focus selectively on parts rather than on a whole image, combining information from several fixations to form the full scene (Mnih et al, 2014). This mechanism allows the network to focus on a subset of inputs or parameters when trained on a task. Attention has improved the performance of these models, and it has also given them a chance to be more explainable. Inspecting what the model is paying attention to helps to make the model accountable in tasks such as image classification, document classification or automatic image captioning. Despite this potential, researchers in the area of machine learning usually use the traditional visualization idioms available in software packages, rather than studying all the options for visual encodings to represent models, results or parameters more effectively. We see a chance of using design principles from information visualization in order to improve the way that neural attention models are visually presented.
This article focuses on the design space to analyze, inspect and understand what neural attention models are learning. In particular, we aim at contributing to the field of Explainable Artificial Intelligence (XAI), by describing the potential design space as well as informed decisions to take into account when presenting the results of neural networks using the attention mechanism. We also propose some initial ideas with a use case: classification of biomedical documents.
https://observablehq.com/@clpuc/analyzing-the-design-space-for-visualizing-neural-attenti
Document screening is a fundamental task within Evidencebased Medicine (EBM), a practice that provides scientific evidence to support medical decisions. Several approaches are attempting to reduce the workload of physicians who need to screen and label hundreds or thousands of documents in order to answer specific clinical questions. Previous works have attempted to semi-automate document screening, reporting promising results, but their evaluation is conducted using small datasets, which hinders generalization. Moreover, some recent works have used recently introduced neural language models, but no previous work have compared, for this task, the performance of different language models based on neural word embeddings, which have reported good results in the latest years for several NLP tasks. In this work, we evaluate the performance of two popular neural word embeddings (Word2vec and GloVe) in an active learning-based setting for document screening in EBM, with the goal of reducing the number of documents that physicians need to label in order to answer clinical questions. We evaluate these methods in a small public dataset (HealthCLEF 2017) as well as a larger one (Epistemonikos). Our experiments indicate that Word2vec have less variance and better general performance than GloVe when using active learning strategies based on uncertainty sampling.
http://ceur-ws.org/Vol-2414/paper10.pdf
Scholars have advanced many theoretical explanations for expecting a negative or positive relationship between individuals’ cross-cutting exposure—either through interpersonal or mediated forms of communication—and their political participation. However, whether cross-cutting exposure is a positive or negative predictor of participation is still an unsettled question. To help fill this gap, we conducted a meta-analysis of 48 empirical studies comprising more than 70,000 participants examining the association between cross-cutting exposure and political participation. The meta-analysis produced two main findings. First, it shows that, over all studies, there is no significant relationship, r = .002, Zr = .002 (95% CI = −.04 to .05). Second, the null relationship cannot be explained by variations in the characteristics of cross-cutting environments (e.g., topic, place, or source of exposure), participation outcomes (e.g., online vs. offline activities), or methods employed (e.g., experiment vs. survey). Taken together, these results should alleviate concerns about negative effects of cross-cutting exposure on political engagement. Implications for future research are discussed.
Online activity involves the consumption and production of event-related content. There are about 500 million Twitter messages published every day, and according to surveys, 59% of its users use the platform as a way to get the news. Its high rate of production of multimodal content (text, images, and videos) necessitates having flexible models to understand the dynamics of the information disseminated on social media. This thesis proposes the creation of context models from usergenerated messages on Twitter to discover knowledge as a way to perform high-level quantitative analysis of news events. These models are useful in three perspectives: the spatio-temporal context in which the events develop, the activity of users that react when a high-impact event happens, and the multimodal content that can be exploited to generate a comprehensive summary of the event. Our current work involves the creation of a geopolitical model that relates events and countries, allowing us to discover international relations; the study of what features make an event susceptible to provoke high activity from users, and a characterization that allows us to predict with high precision which events are going to produce high activity. This includes our ongoing work on generating automatic multimodal summaries of events based on the assumption that the users describe the non-textual content in their tweets when they express their facts and opinions around events.
Introduction Memory-bounded search algorithms are typically used when the search space is too large for regular best-first search algorithms like A* to store in memory. There exists a large class of memory-bounded best-first search algorithms including Depth-First Branch-and-Bound (DFBnB), Iterative Deepening A* (IDA*) (Korf 1985), Recursive Best-First Search (RBFS) (Korf 1993), and Simplified Memory-Bounded A* (SMA*) (Russell 1992). Each of these algorithms rely on a different strategy to ensure that they use only a bounded amount of memory: IDA* bounds the amount of memory used by repeatedly running depth-first searches, increasing the explored depth at each iteration. RBFS uses lower and upper bounds that are tightened over time as it explores the search space while keeping only b · d nodes in memory, where b is the branching factor and d is the depth of the tree. And, finally, SMA* keeps only a bounded number of nodes in memory by pruning the least promising nodes from the OPEN list when it runs out of memory. In this abstract, we summarize an alternative approach to memory-bounded best-first search. It is motivated by realtime heuristic search algorithms (Korf 1990), many of which iterate the following steps until the goal is reached: up to k nodes are expanded, where k is a user-defined bound; the h values of expanded nodes are updated to make them more informed; the agents moves along a path along the search tree just expanded. We propose a general framework that iteratively (1) runs a memory-bounded best-first search algorithm that terminates when k nodes are generated. If no solution is found, (2) it updates the h-values of the generated nodes, and (3) purges the h values of some nodes from memory. As such, the total number of h-values ever stored by our approach is upper-bounded by a constant. Under certain (reasonable) conditions, our framework is complete and preserves the (sub)optimality guarantees of the given best-first search algorithm in tree-shaped search spaces. The main conceptual difference between our framework and the SMA* algorithm is that it can be combined with any bestfirst algorithm with very minor modifications. We present experimental results where we plug into our framework memory-bounded variants of Weighted A* (Pohl 1970). On traveling salesman problems we show that our framework is often able to find better solutions than DFBnB and Weighted DFBnB (wDFBnB) and in a smaller amount of time, especially in problems with large search spaces.
https://aaai.org/ocs/index.php/SOCS/SOCS19/paper/viewFile/18376/17491
Introduction: Multi-Agent Pathfinding (MAPF) over grids is the problem of finding n non-conflicting paths that lead n agents from a given initial cell to a given goal cell. Sum-of-costsoptimal MAPF, or simply cost-optimal MAPF, in addition, minimizes the total number of actions performed by each agent before stopping at the goal. Being a combinatorial problem in nature, a number of compilations from MAPF to Satisfiability (SAT) (Surynek et al. 2016) and Answer Set Programming (ASP) exist (Erdem et al. 2013; Gebser et al. 2018). Here we propose and evaluate a new compilation of MAPF over grids to ASP. Unlike existing compilations we are aware of, both to SAT and to ASP, our encoding is the first that produces a number of clauses that is linear on the number of agents. In addition, the clauses that allow representing the optimization objective are also efficiently written, and do not depend on the size of the grid. Like makespan-optimal approaches, our algorithm searches for cost-optimal solutions with increasing makespan. When a solution is found a provably correct upper bound on the maximum makespan at which a true cost-optimal solution exists is computed, and the solver is rerun once more.
https://aaai.org/ocs/index.php/SOCS/SOCS19/paper/viewFile/18374/17489
We study the boundedness problem for unions of conjunctive regular path queries with inverses (UC2RPQs). This is the problem of, given a UC2RPQ, checking whether it is equivalent to a union of conjunctive queries (UCQ). We show the problem to be ExpSpace-complete, thus coinciding with the complexity of containment for UC2RPQs. As a corollary, when a UC2RPQ is bounded, it is equivalent to a UCQ of at most triple-exponential size, and in fact we show that this bound is optimal. We also study better behaved classes of UC2RPQs, namely acyclic UC2RPQs of bounded thickness, and strongly connected UCRPQs, whose boundedness problem is, respectively, PSpace-complete and Pi_2^P-complete. Most upper bounds exploit results on limitedness for distance automata, in particular extending the model with alternation and two-wayness, which may be of independent interest.
Monadic decomposibility – the ability to determine whether a formula in a given logical theory can be decomposed into a boolean combination of monadic formulas – is a powerful tool for devising a decision procedure for a given logical theory. In this paper, we revisit a classical decision problem in automata theory: given a regular (a.k.a. synchronized rational) relation, determine whether it is recognizable, i.e., it has a monadic decomposition (that is, a representation as a boolean combination of cartesian products of regular languages). Regular relations are expressive formalisms which, using an appropriate string encoding, can capture relations definable in Presburger Arithmetic. In fact, their expressive power coincide with relations definable in a universal automatic structure; equivalently, those definable by finite set interpretations in WS1S (Weak Second Order Theory of One Successor). Determining whether a regular relation admits a recognizable relation was known to be decidable (and in exponential time for binary relations), but its precise complexity still hitherto remains open. Our main contribution is to fully settle the complexity of this decision problem by developing new techniques employing infinite Ramsey theory. The complexity for DFA (resp. NFA) representations of regular relations is shown to be NLOGSPACE-complete (resp. PSPACE-complete).
This work introduces a companion reproducible paper with the aim of allowing the exact replication of the methods, experiments, and results discussed in a previous work Claude et al., (2016). In that parent paper, we proposed many and varied techniques for compressing indexes which exploit that highly repetitive collections are formed mostly of documents that are near-copies of others. More concretely, we describe a replication framework, called uiHRDC (universal indexes for Highly Repetitive Document Collections), that allows our original experimental setup to be easily replicated using various document collections. The corresponding experimentation is carefully explained, providing precise details about the parameters that can be tuned for each indexing solution. Finally, note that we also provide uiHRDC as a reproducibility package.
Māori loanwords are widely used in New Zealand English for various social functions by New Zealanders within and outside of the Māori community. Motivated by the lack of linguistic resources for studying how Māori loanwords are used in social media, we present a new corpus of New Zealand English tweets. We collected tweets containing selected Māori words that are likely to be known by New Zealanders who do not speak Māori. Since over 30% of these words turned out to be irrelevant, we manually annotated a sample of our tweets into relevant and irrelevant categories. This data was used to train machine learning models to automatically filter out irrelevant tweets.
The sheer amount of newsworthy information published by users in social media platforms makes it necessary to have efficient and effective methods to filter and organize content. In this scenario, off-the-shelf methods fail to process large amounts of data, which is usually approached by adding more computational resources. Simple data aggregations can help to cope with space and time constraints, while at the same time improve the effectiveness of certain applications, such as topic detection or summarization. We propose a lightweight representation of newsworthy social media data. The proposed representation leverages microblog features, such as redundancy and re-sharing capabilities, by using surrogate texts from shared URLs and word embeddings. Our representation allows us to achieve comparable clustering results to those obtained by using the complete data, while reducing running time and required memory. This is useful when dealing with noisy and raw user-generated social media data.
Hate speech is an important problem that is seriously affecting the dynamics and usefulness of online social communities. Large scale social platforms are currently investing important resources into automatically detecting and classifying hateful content, without much success. On the other hand, the results reported by state-of-the-art systems indicate that supervised approaches achieve almost perfect performance but only within specific datasets. In this work, we analyze this apparent contradiction between existing literature and actual applications. We study closely the experimental methodology used in prior work and their generalizability to other datasets. Our findings evidence methodological issues, as well as an important dataset bias. As a consequence, performance claims of the current state-of-the-art have become significantly overestimated. The problems that we have found are mostly related to data overfitting and sampling issues. We discuss the implications for current research and re-conduct experiments to give a more accurate picture of the current state-of-the art methods.
Signaling pathways are responsible for the regulation of cell processes, such as monitoring the external environment, transmitting information across membranes, and making cell fate decisions. Given the increasing amount of biological data available and the recent discoveries showing that many diseases are related to the disruption of cellular signal transduction cascades, in silico discovery of signaling pathways in cell biology has become an active research topic in past years. However, reconstruction of signaling pathways remains a challenge mainly because of the need for systematic approaches for predicting causal relationships, like edge direction and activation/inhibition among interacting proteins in the signal flow. We propose an approach for predicting signaling pathways that integrates protein interactions, gene expression, phenotypes, and protein complex information. Our method first finds candidate pathways using a directed-edge-based algorithm and then defines a graph model to include causal activation relationships among proteins, in candidate pathways using cell cycle gene expression and phenotypes to infer consistent pathways in yeast. Then, we incorporate protein complex coverage information for deciding on the final predicted signaling pathways. We show that our approach improves the predictive results of the state of the art using different ranking metrics.24
In ontology-mediated querying, description logic (DL) ontologies are used to enrich incomplete data with domain knowledge which results in more complete answers to queries. However, the evaluation of ontology-mediated queries (OMQs) over relational databases is computationally hard. This raises the question when OMQ evaluation is efficient, in the sense of being tractable in combined complexity or fixed-parameter tractable. We study this question for a range of ontology-mediated query languages based on several important and widely-used DLs, using unions of conjunctive queries as the actual queries. For the DL ELHI⊥, we provide a characterization of the classes of OMQs that are fixed-parameter tractable. For its fragment ELH⊥ dr , which restricts the use of inverse roles, we provide a characterization of the classes of OMQs that are tractable in combined complexity. Both results are in terms of equivalence to OMQs of bounded tree width and rest on a reasonable assumption from parameterized complexity theory. They are similar in spirit to Grohe’s seminal characterization of the tractable classes of conjunctive queries over relational databases. We further study the complexity of the meta problem of deciding whether a given OMQ is equivalent to an OMQ of bounded tree width, providing several completeness results that range from NP to 2ExpTIME, depending on the DL used. We also consider the DL-Lite family of DLs, including members that, unlike εLHI ⊥ , admit functional roles.
Abstract. A typical reverse engineering problem for a query language L is, given a database D and a sets P and N of tuples over D labeled as positive and negative examples, respectively, is there a query q in L that explains P and N, i.e., the evaluation of q on D contains all positive examples in P and none of the negative examples in N? Applications of reverse engineering problems include query-by-example, classifier engineering, and the study of the expressive power of query languages. In this talk I will present a family of tests that solve the reverse engineering problem described above for several query languages of interest, e.g., FO, CQ, UCQs, RPQs, CRPQs, etc. We will see that in many cases such tests directly provide optimal bounds for the problem, as well as for the size of the smallest query that explains the given labeled examples. I will also present restrictions that alleviate the complexity of the problem when it is too high. Finally, I will develop the relationship between reverse engineering and a separability problem recently introduced to assist the task of feature engineering with data management tools.
The mechanisms by which users of platforms such as Facebook and Twitter spread misinformation are not well understood. In this study, we argue that the effects of informational uses of social media on political participation are inextricable from its effects on misinformation sharing. That is, political engagement is both a major consequence of using social media for news as well as a key antecedent of sharing misinformation. We test our expectations via a two-wave panel survey of online media users in Chile, a country experiencing information disorders comparable to those of the global North. Analyses of the proposed and alternative causal models with two types of structural equation specifications (fixed effects and autoregressive) support our theoretical model. We close with a discussion on how changes in the way people engage with news and politics – brought about by social media – have produced a new dilemma: how to sustain a citizenry that is enthusiastically politically active, yet not spreading misinformation?
The field of knowledge compilation establishes the tractability of many tasks by studying how to compile them to Boolean circuit classes obeying some requirements such as structuredness, decomposability, and determinism. However, in other settings such as intensional query evaluation on databases, we obtain Boolean circuits that satisfy some width bounds, e.g., they have bounded treewidth or pathwidth. In this work, we give a systematic picture of many circuit classes considered in knowledge compilation and show how they can be systematically connected to width measures, through upper and lower bounds. Our upper bounds show that bounded-treewidth circuits can be constructively converted to d-SDNNFs, in time linear in the circuit size and singly exponential in the treewidth; and that bounded-pathwidth circuits can similarly be converted to uOBDDs. We show matching lower bounds on the compilation of monotone DNF or CNF formulas to structured targets, assuming a constant bound on the arity (size of clauses) and degree (number of occurrences of each variable): any d-SDNNF (resp., SDNNF) for such a DNF (resp., CNF) must be of exponential size in its treewidth, and the same holds for uOBDDs (resp., n-OBDDs) when considering pathwidth. Unlike most previous work, our bounds apply to any formula of this class, not just a well-chosen family. Hence, we show that pathwidth and treewidth respectively characterize the efficiency of compiling monotone DNFs to uOBDDs and d-SDNNFs with compilation being singly exponential in the corresponding width parameter. We also show that our lower bounds on CNFs extend to unstructured compilation targets, with an exponential lower bound in the treewidth (resp., pathwidth) when compiling monotone CNFs of constant arity and degree to DNNFs (resp., nFBDDs).
Drawing on social-psychological and political research, we offer a theoretical model that explains how people become exposed to fake news, come to believe in them and then share them with their contacts. Using two waves of a nationally representative sample of Chileans with internet access, we pinpoint the relevant causal factors. Analysis of the panel data indicate that three groups of variables largely explain these phenomena: (1) Personal and psychological factors such as belief in conspiracy theories, trust in others, education and gender; (2) Frequency and specific uses of social media; and (3) Political views and online activism. Importantly, personal and political-psychological factors are more relevant in explaining this behavior than specific uses of social media.
The popularity of microblogging websites makes them important for information dissemination. The diffusion of large volumes of fake or unverified information could emerge and spread producing damage. Due to the ever-increasing volume of data and the nature of complex diffusion, automatic rumor detection is a very challenging task. Supervised classification and other approaches have been widely used to identify rumors in social media posts. However, despite achieving competitive results, only a few studies have delved into the nature of the problem itself in order to identify key empirical factors that allow defining both the baseline models and their performance. In this work, we learn discriminative features from tweets content and propagation trees by following their sequential propagation structure. To do this we study the performance of a number of architectures based on recursive neural networks conditioning for rumor detection. In addition, to ingest tweets into each network, we study the effect of two different word embeddings schemes: Glove and Google news skip-grams. Results on the Twitter16 dataset show that model performance depends on many empirical factors and that some specific experimental configurations consistently drive to better results.
The Mercalli scale of quake damages is based on perceived effects and it has a strong dependence on observers. Recently, we proposed a method for ground shaking intensity estimation based on lexical features extracted from tweets, showing good performance in terms of mean absolute error (MAE). One of the flaws of that method is the detection of the region of interest, i.e., the area of a country where the quake was felt. Our previous results showed enough recall in terms of municipality recovery but a poor performance in terms of accuracy. One of the reasons that help to explain this effect is the presence of data noise as many people comment or confirm a quake in areas where the event was unperceived. This happens because people get awareness of an event by watching news or by word-of-mouth propagation. To alleviate this problem in our earthquake detection system we study how propagation features behave in a region of interest estimation task. The intuition behind our study is that the patterns that characterize a word-of-mouth propagation differ from the patterns that characterize a perceived event. If this intuition is true, we expect to separate both kinds of propagation modes. We do this by computing a number of features to represent propagation trees. Then, we trained a learning algorithm using our features in the specific task of region of interest estimation. Our results show that propagation features behave well in this task, outperforming lexical features in terms of accuracy.
Social media is the primary source of information for many people around the world, not only to know about their families and friends but also to read about news and trends in different areas of interest. Fake News or rumors can generate big problems of misinformation, being able to change the mindset of a large group of people concerning a specific topic. Many companies and researchers have put their efforts into detecting these rumors with machine learning algorithms creating reports of the influence of these “news” in social media (https://www.knightfoundation.org/reports/disinformation-fake-news-and-influence-campaigns-on-twitter). Only a few studies have been made in detecting rumors in real-time, considering the first hours of propagation. In this work, we study the spread of a claim, analyzing different characteristics and how propagation patterns behave in time. Experiments show that rumors have different behaviours that can be used to classify them within the first hours of propagation.
Trajectory patterns characterize similar behaviors among trajectories, which play an important role in applications such as urban planning, traffic congestion control, and studies of animal migration and natural phenomena. In this paper we model trajectories as a sequence of line segments that represent the steady movement of an object along time. We use a segment-clustering process to group trajectories’ segments and partial segments based on their temporal and spatial closeness. Then, it defines a trajectory pattern that results from the aggregation of segment clusters, aggregation that is not only based on spatial and temporal sequentiality, but also on the compatibility of trajectories in each segment cluster. The experimental assessment shows the effectiveness of the method.
G-CORE is a query language with two key characteristics: It is closed under graphs and incoporates paths as first-class citizens. Currently G-CORE does not have recursion. In this paper we propose this extension and show how to code classical polynomial graph algorithms with it.
RDF and Property Graph databases are two approaches for data management that are based on modeling, storing and querying graph-like data. In this paper, we present a short study about the interoperability between these approaches. We review the current solutions to the problem, identify their features, and discuss the inherent issues.
People use the news media to learn about the world beyond their family, neighborhood, and workplace. As news consumers, we depend on what television, social media, websites, radio stations, and newspapers decide to inform us about. This is because all news media, whether through journalists or digital algorithms, select, process, and filter information to their users. Over time, the aspects that are prominent in the news media usually become prominent in public opinion. The ability of journalists to influence which issues, aspects of these issues, and persons related to these issues, are perceived as the most salient has come to be called the agenda-setting effect of journalism.
First described by Maxwell McCombs and Donald Shaw in a seminal study conducted during the 1968 elections in the United States, agenda-setting theory has expanded to include several other aspects beyond the transfer of salience of issues from the media agenda to the public agenda. These aspects include: the influence of journalism on the attributes of issues and people that make news; the networks between the different elements in the media and public agendas; the determinants of the news media agenda; the psychological mechanisms that regulate agenda-setting effects; and the consequences of agenda setting on both citizens’ and policymakers’ attitudes and behaviors. As one of the most comprehensive and international theories of journalism studies available, agenda setting continues to evolve in the expanding digital media landscape.
https://doi.org/10.1093/acrefore/9780190228613.013.777
This study seeks to deepen our understanding of the factors that explain individuals’ willingness to self-censor (WtSC)—the proclivity to withhold an opinion from an audience perceived to disagree with that opinion. It does so by testing the “impressionable years” hypothesis, which states that the historical context experienced between the age of 18 and 25 years has a lasting effect on individual dispositions such as WtSC. The study was conducted in Chile, an ideal case to explore possible cohort effects because of the profound political changes experienced there in the past 50 years. Analysis of an original cross-sectional survey shows that—as expected—people who came of age in periods of political repression exhibit significantly higher levels of WtSC later in life compared with those who grew up during less repressive times.
We consider the feature-generation task wherein we are given a database with entities labeled as positive and negative examples, and the goal is to find feature queries that allow for a linear separation between the two sets of examples. We focus on conjunctive feature queries, and explore two fundamental problems: (a) deciding whether separating feature queries exist (separability), and (b) generating such queries when they exist. In the approximate versions of these problems, we allow a predefined fraction of the examples to be misclassified. To restrict the complexity of the generated classifiers, we explore various ways of regularizing (i.e., imposing simplicity constraints on) them by limiting their dimension, the number of joins in feature queries, and their generalized hypertree width (ghw). Among other results, we show that the separability problem is tractable in the case of bounded ghw; yet, the generation problem is intractable, simply because the feature queries might be too large. So, we explore a third problem: classifying new entities without necessarily generating the feature queries. Interestingly, in the case of bounded ghw we can efficiently classify without ever explicitly generating the feature queries.
Tensors are one of the most widely used data structures in modern Machine Learning applications. Although they provide a flexible way of storing and accessing data, they often expose too many low-level details that may result in error prone code that is difficult to maintain and extend. Abstracting low-level functionalities into high-level operators in the form of a query language is a task in which the Data Management community has extensive experience. It is thus important to understand how such an experience can be applied in the design of useful languages for tensor manipulation.
In this short paper we study a matrix and a tensor query language that have been recently proposed in the database literature. We show, by using examples, how these proposals are in line with the practical interest in rethinking tensor abstractions. On the technical side, we compare the two languages in terms of operators that naturally arise in Machine Learning pipelines, such as convolution, matrix-inverse, and Einstein summation. We hope our results to provide a theoretical kick-off for the discussion on the design of core declarative query languages for tensors.
In this article we review the main concepts around database repairs and consistent query answering, with emphasis on tracing back the origin, motivation, and early developments. We also describe some research directions that has spun from those main concepts and the original line of research. We emphasize, in particular, fruitful and recent connections between repairs and causality in databases.
We present a method to support high quality visualization recommendations for analytic tasks. Visualization converts large datasets into images that allow viewers to efficiently explore, discover, and validate within their data. Visualization recommenders have been proposed that store past sequences: an ordered collection of design choices leading to successful task completion; then match them against an ongoing visualization construction. Based on this matching, a system recommends visualizations that better support the analysts’ tasks. A problem of scalability occurs when many sequences are stored. One solution would be to index the sequence database. However, during matching we require sequences that are similar to the partially constructed visualization, not only those that are identical. We implement a locality sensitive hashing algorithm that converts visualizations into set representations, then uses Jaccard similarity to store similar sequence nodes in common hash buckets. This allows us to match partial sequences against a database containing tens of thousands of full sequences in less than 100ms. Experiments show that our algorithm locates 95% or more of the sequences found in an exhaustive search, producing high-quality visualization recommendations.
LINK: https://doi.org/10.20380/GI2019.05
We study two simple yet general complexity classes, which provide a unifying framework for ecient query evaluation in areas like graph databases and information extraction, among others. We investigate the complexity of three fundamental algorithmic problems for these classes: enumeration, counting and uniform generation of solutions, and show that they have several desirable properties in this respect. Both complexity classes are defined in terms of non deterministic logarithmic-space transducers (NL transducers). For the first class, we consider the case of unambiguous NL transducers, and we prove constant delay enumeration, and both counting and uniform generation of solutions in polynomial time. For the second class, we consider unrestricted NL transducers, and we obtain polynomial delay enumeration, approximate counting in polynomial time, and polynomialtime randomized algorithms for uniform generation. More specifically, we show that each problem in this second class admits a fully polynomial-time randomized approximation scheme (FPRAS) and a polynomial-time Las Vegas algorithm (with preprocessing) for uniform generation. Remarkably, the key idea to prove these results is to show that the fundamental problem #NFA admits an FPRAS, where #NFA is the problem of counting the number of strings of length n (given in unary) accepted by a non-deterministic finite automaton (NFA). While this problem is known to be #P-complete and, more precisely, SpanL-complete, it was open whether this problem admits an FPRAS. In this work, we solve this open problem, and obtain as a welcome corollary that every function in SpanL admits an FPRAS.
We propose and investigate a concrete numerical measure of the inconsistency of a database with respect to a set of integrity constraints. It is based on a database repair semantics associated to cardinality-repairs. More specifically, it is shown that the computation of this measure can be intractable in data complexity, but answer-set programs are exhibited that can be used to compute it. Furthermore, its is established that there are polynomial-time deterministic and randomized approximations. The behavior of this measure under small updates is analyzed, obtaining fixed-parameter tractability results. We explore abstract extensions of this measure that appeal to generic classes of database repairs. Inconsistency measures and repairs at the attribute level are investigated as a particular, but relevant and natural case.
User satisfaction is an important aspect to consider in any public transport system, and as such, regular and sound measurements of its levels are fundamental. However, typical evaluation schemes involve costly and time-consuming surveys. As a consequence, their frequency is not enough to properly and timely characterize the satisfaction of the users. In this paper, we propose a methodology, based on Twitter data, to capture the satisfaction of a large mass of users of public transport, allowing us to improve the characterization and location of their satisfaction level. We analyzed a massive volume of tweets referring to the public transport system in Santiago, Chile (Transantiago) using text mining techniques, such as sentiment analysis and topic modeling, in order to capture and group bus users’ expressions. Results show that, although the level of detail and variety of answers obtained from surveys are higher than the ones obtained by our method, the amount of bus stops and bus services covered by the proposed scheme is larger. Moreover, the proposed methodology can be effectively used to diagnose problems in a timely manner, as it is able to identify and locate trends, and issues related to bus operating firms, whereas surveys tend to produce average answers. Based on the consistency and logic of the results, we argue that the proposed methodology can be used as a valuable complement to surveys, as both present different, but compatible characteristics.
https://doi.org/10.1109/ACCESS.2019.2915107
AffectiveTweets is a set of programs for analyzing emotion and sentiment of social media messages such as tweets. It is implemented as a package for the Weka machine learning workbench and provides methods for calculating state-of-the-art affect analysis features from tweets that can be fed into machine learning algorithms implemented in Weka. It also implements methods for building affective lexicons and distant supervision methods for training affective models from unlabeled tweets. The package was used by several teams in the shared tasks: EmoInt 2017 and Affect in Tweets SemEval 2018 Task 1.
Due to advances in digitization technology, documentation efforts and digital library systems, increasingly large collections of visual Cultural Heritage (CH) object data becomes available, offering rich opportunities for domain analysis, e.g., for comparing, tracing and studying objects created over time. In principle, existing shape- and image-based similarity search methods can aid such domain analysis tasks. However, in practice, visual object data are given in different modalities, including 2D, 3D, sketches or conventional drawings like profile sections or unwrappings. In addition, collections may be distributed across different publications and repositories, posing a challenge for implementing encompassing search and analysis systems. We introduce a methodology and system for cross-modal visual search in CH object data. Specifically, we propose a new query modality based on 3D views enhanced by user sketches (3D+sketch). This allows for adding new context to the search, which is useful e.g., for searching based on incomplete query objects, or for testing hypotheses on existence of certain shapes in a collection. We present an appropriately designed workflow for constructing query views from incomplete 3D objects enhanced by a user sketch based on shape completion and texture inpainting. Visual cues additionally help users compare retrieved objects with the query. We apply our method on a set of relevant 3D and view-based CH object data, demonstrating the feasibility of our approach and its potential to support analysis of domain experts in Archaeology and the field of CH in general.
https://doi.org/10.2312/3dor.20191057
The Bitcoin protocol and its underlying cryptocurrency have started to shape the way we view digital currency, and opened up a large list of new and interesting challenges. Amongst them, we focus on the question of how is the price of digital currencies affected, which is a natural question especially when considering the price rollercoaster we witnessed for bitcoin in 2017-2018. We work under the hypothesis that price is affected by the web footprint of influential people, we refer to them as crypto-influencers.
In this paper we provide neural models for predicting bitcoin price. We compare what happens when the model is fed only with recent price history versus what happens when fed, in addition, with a measure of the positivity or negativity of the sayings of these influencers, measured through a sentiment analysis of their twitter posts. We show preliminary evidence that twitter data should indeed help to predict the price of bitcoin, even though the measures we use in this paper have a lot of room for refinement. In particular, we also discuss the challenges of measuring the correct sensation of these posts, and discuss the work that should help improving our discoveries even further.
The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with a corresponding unambiguous entry in a Knowledge Base. The evaluation of EL systems relies on the comparison of their results against gold standards. A common format used to represent gold standard datasets is the NLP Interchange Format (NIF), which uses RDF as a data model. However, creating gold standard datasets for EL is a time-consuming and error-prone process. In this paper we propose a tool called NIFify to help manually generate, curate, visualize and validate EL annotations; the resulting tool is useful, for example, in the creation of gold standard datasets. NIFify also serves as a benchmark tool that enables the assessment of EL results. Using the validation features of NIFify, we further explore the quality of popular EL gold standards.
Complex human behaviors related to crime require multiple sources of information to understand them. Social Media is a place where people share opinions and news. This allows events in the physical world like crimes to be reflected on Social Media. In this paper we study crimes from the perspective of Social Media, specifically car theft and Twitter. We use data of car theft reports from Twitter and car insurance companies in Chile to perform a temporal analysis. We found that there is an increasing correlation in recent years between the number of car theft reports in Twitter and data collected from insurance companies. We performed yearly, monthly, daily and hourly analyses. Though Twitter is an unstructured source and very noisy, it allows you to estimate the volume of thefts that are reported by the insurers. We experimented with a Moving Average to predict the tendency in the number of car theft reported to insurances using Twitter data and found that one month is the best time window for prediction.
Recent work suggests that certain places can be more attractive for car theft based how many people regularly visit them, as well as other factors. In this sense, we must also consider the city or district itself where vehicles are stolen. All cities have different cultural and socioeconomic characteristics that influence car theft patterns. In particular, the distribution of public services and places attract a large crowd could play a key role in the occurrence of car theft. Santiago, a city that displays drastic socioeconomic differences among its districts, presents increasingly-high car theft rates. This represents a serious issue for the city, as for any other major city, which –at least for Santiago– has not been analyzed in depth using quantitative approaches. In this work, we present a preliminary study of how places that create social interest, such as restaurants, bars, schools, and shopping malls, increase car theft frequency in Santiago. We also study if some types of places are more attractive than others for this type of crime. To evaluate this, we propose to analyze car theft points (CTP) from insurance companies and their relationship with places of social interest (PSI) extracted from Google Maps, using a proximity based approach. Our findings show a high correlation between CTP and PSI for all of the social interest categories that we studied in the different districts of the Santiago. In particular our work contributes to the understanding of the social factors that are associated to car thefts.
The world of video games has changed considerably over the recent years. Its diversification has dramatically increased the number of users engaged in online communities of this entertainment area, and consequently, the number and types of games available. This context of information overload underpins the development of recommender systems that could leverage the information that the video game platforms collect, hence following the trend of new games coming out every year. In this work we test the potential of state-of-the-art recommender models based respectively on Factorization Machines (FM), deep neural networks (DeepNN) and one derived from the mixture of both (DeepFM), chosen for their potential of receiving multiple inputs as well as different types of input variables. We evaluate our results measuring the ranking accuracy of the recommendation and the diversity/novelty of a recommendation list. All the algorithms achieve better results than a baseline based on implicit feedback (Alternating Least Squares model). The best performing algorithm is DeepNN, the high order interactions are more important than the low order ones for this recommendation task. We also analyze the effect of the sentiment extracted directly from game reviews, and find that it is not as relevant for recommendation as one might expect. We are the first in studying the aforementioned recommender systems over the context of online video game platforms, reporting novel results which could be used as baseline in future works.
Graph serialization is very important for the development of graph-oriented applications. In particular, serialization methods are fundamental in graph data management to support database exchange, benchmarking of systems, and data visualization. This paper presents YARS-PG, a data format for serializing property graphs. YARS-PG was designed to be simple, extensible and platform independent, and to support all the features provided by the current database systems based on the property graph data model.
Chile is a country with high levels of digital news consumption but decreasing levels of confidence in journalism and traditional news media outlets. In a place where natural disasters are common, Chilean citizens usually turn to digital and social media to find out more information about how events unfold. By relying on in-depth interviews with reporters who covered the 2014 earthquake in northern Chile, this study examines how Chilean journalists approached a highly politicized natural disaster. Results show that reporters covered the earthquake as a political issue due to editorial prompting, and they used social media as another way to get close to the sources they already know, but not to look for alternative sources. The implications of these findings for media scholars and practitioners relate to the normalization of social media use among journalists, and the influence of a news outlet’s political leaning on journalistic practices.
https://isoj.org/research/exposing-the-president-the-political-angle-of-a-natural-disaster-in-chile/
This article presents WekaCoin, a peer-to-peer cryptocurrency based on a new distributed consensus protocol called Proof-of-Learning. Proof-of-learning achieves distributed consensus by ranking machine learning systems for a given task. The aim of this protocol is to alleviate the computational waste involved in hashing-based puzzles and to create a public distributed and verifiable database of state-of-the-art machine learning models and experiments.
Different pattern recognition techniques such as clustering, k-nearest neighbor classification, or instance reduction algorithms require prototypes to represent pattern classes. In many applications, strings are used to encode instances, for example, in contour representations or in biological data such as DNA, RNA, and protein sequences. Median strings have been used as representatives of a set of strings in different domains. Finding the median string is an NP-Complete problem for several formulations. Alternatively, heuristic approaches that iteratively refine an initial coarse solution by applying edit operations have been proposed. We propose here a novel algorithm that outperforms the state of the art heuristic approximations to the median string in terms of convergence speed by estimating the effect of a perturbation in the minimization of the expressions that define the median strings. We present comparative experiments to validate these results.
Recommender Systems help us deal with information overload by suggesting relevant items based on our personal preferences. Although there is a large body of research in areas such as movies or music, artwork recommendation has received comparatively little attention, despite the continuous growth of the artwork market. Most previous research has relied on ratings and metadata, and a few recent works have exploited visual features extracted with deep neural networks (DNN) to recommend digital art. In this work, we contribute to the area of content-based artwork recommendation of physical paintings by studying the impact of the aforementioned features (artwork metadata, neural visual features), as well as manually-engineered visual features, such as naturalness, brightness and contrast. We implement and evaluate our method using transactional data from UGallery.com, an online artwork store. Our results show that artwork recommendations based on a hybrid combination of artist preference, curated attributes, deep neural visual features and manually-engineered visual features produce the best performance. Moreover, we discuss the trade-off between automatically obtained DNN features and manually-engineered visual features for the purpose of explainability, as well as the impact of user profile size on predictions. Our research informs the development of next-generation content-based artwork recommenders which rely on different types of data, from text to multimedia.
We present D-v2v, a new dynamic (one-pass) variable-to-variable compressor. Variable-to-variable compression aims at using a modeler that gathers variable-length input symbols and a variable-length statistical coder that assigns shorter codewords to the more frequent symbols. In D-v2v, we process the input text word-wise to gather variable-length symbols that can be either terminals (new words) or non-terminals, subsequences of words seen before in the input text. Those input symbols are set in a vocabulary that is kept sorted by frequency. Therefore, those symbols can be easily encoded with dense codes. Our D-v2v permits real-time transmission of data, i.e. compression/transmission can begin as soon as data become available. Our experiments show thatD-v2vis able to overcome the compression ratios of the v2vDC, the state-of-the-art semi-static variable-to-variable compressor, and to almost reach p7zip values. It also draws a competitive performance at both compression and decompression.
The raster model is widely used in Geographic Information Systems to represent data that vary continuously in space, such as temperatures, precipitations, elevation, among other spatial attributes. In applications like weather forecast systems, not just a single raster, but a sequence of rasters covering the same region at different timestamps, known as a raster time series, needs to be stored and queried. Compact data structures have proven successful to provide space-efficient representations of rasters with query capabilities. Hence, a naive approach to save space is to use such a representation for each raster in a time series. However, in this paper, we show that it is possible to take advantage of the temporal locality that exists in a raster time series to reduce the space necessary to store it while keeping competitive query times for several types of queries
Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog, the so-called warded Datalog±, under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a practical point of view, this allows us to handle the bag semantics of Datalog by powerful, existing query engines for the required extension of Datalog. This use of Datalog± is extended to give a set semantics to duplicates in Datalog± itself. We investigate the properties of the resulting Datalog± programs, the problem of deciding multiplicities, and expressibility of some bag operations. Moreover, the proposed translation has the potential for interesting applications such as to Multiset Relational Algebra and the semantic web query language SPARQL with bag semantics.
Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and generalize. Moreover, there are no general techniques for evaluating CEP query languages with clear performance guarantees. In this paper we embark on the task of giving a rigorous and efficient framework to CEP. We propose a formal language for specifying complex events, called CEL, that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. With a well-defined semantics at hand, we discuss how to efficiently process complex events by evaluating CEL formulas with unary filters. We start by studying the syntactical properties of CEL and propose rewriting optimization techniques for simplifying the evaluation of formulas. Then, we introduce a formal computational model for CEP, called complex event automata (CEA), and study how to compile CEL formulas with unary filters into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by constant-delay enumeration of the results. Finally, we gather the main results of this work to present an efficient and declarative framework for CEP.
Recommender systems are frequently evaluated using performance indexes based on variants and extensions of precision-like measures. As these measures are biased toward popular items, a list of recommendations simply must include a few popular items to perform well. To address the popularity bias challenge, new approaches for novelty and diversity evaluation have been proposed. On the one hand, novelty-based approaches model the quality of being new as apposed to that which is already known. Novelty approaches are commonly based on item views or user rates. On the other hand, diversity approaches model the quality of an item that is composed of different content elements. Diversity measures are commonly rooted in content-based features that characterize the diversity of the content of an item in terms of the presence/absence of a number of predefined nuggets of information. As item contents are also biased to popular contents (e.g., drama in movies or pop in music), diversity-based measures are also popularity biased. To alleviate the effect of popularity bias on diversity measures, we used an evaluation approach based on the degree of novelty of the elements that make up each item. We named this approach content novelty, as it mixes content and diversity approaches in a single and coherent evaluation framework. Experimental results show that our proposal is feasible and useful. Our findings demonstrate that the proposed measures yield consistent and interpretable results, producing insights that reduce the impact of popularity bias in the evaluation of recommender systems.
https://doi.org/10.1007/s10844-019-00548-x
The ubiquitous Internet of Things (IoT) devices nowadays are generating various and numerous data from everywhere at any time. Since it is not always necessary to centralize and analyze IoT data cumulatively (e.g., the Monte Carlo analytics and Convergence analytics demonstrated in this article), the traditional implementations of big data analytics (BDA) will suffer from unnecessary and expensive data transmissions as a result of the tight coupling between computing resource management and data processing logic. Inspired by software-defined infrastructure (SDI), we propose the “micro service-oriented platform” to break the environmental monolith and further decouple data processing logics from their underlying resource management in order to facilitate BDA implementations in the IoT environment (which we name “IoBDA”). Given predesigned standard microservices with respect to specific data processing logics, the proposed platform is expected to largely reduce the complexity in and relieve inexperienced practices of IoBDA implementations. The potential contributions to the relevant communities include (1) new theories of a micro service-oriented platform on top of SDI and (2) a functional micro service-oriented platform for IoBDA with a group of predesigned microservices
One of the important roles of the mass media is the setting of agendas in daily life. What is emphasized in the media, whether traditional or digital, has been found to have a profound impact on not only what people think, but the salience of the issues at any given point in time. This chapter reviews the theory behind agenda-setting and the variables that form, shape, and prime the public’s opinions, attitudes, and behaviors. The chapter also examines the new media and how it impacts on agenda-setting theory and research.
There are very few works about explaining content-based recommendations of images in the artistic domain. Current works do not provide a perspective of the many variables involved in the user perception of several aspects of the system such as domain knowledge, relevance, explainability, and trust. In this paper, we aim to fill this gap by studying three interfaces, with different levels of explainability, for artistic image recommendation. Our experiments with N=121 users confirm that explanations of recommendations in the image domain are useful and increase user satisfaction, perception of explainability and relevance. Furthermore, our results show that the observed effects are also dependent on the underlying recommendation algorithm used. We tested two algorithms: Deep Neural Networks (DNN), which has high accuracy, and Attractiveness Visual Features (AVF) with high transparency but lower accuracy. Our results indicate that algorithms should not be studied in isolation, but rather in conjunction with interfaces, since both play a significant role in the perception of explainability and trust for image recommendation. Finally, using the framework by Knijnenburg et al., we provide a comprehensive model which synthesizes the effects between different variables involved in the user experience with explainable visual recommender systems of artistic images.
Conditional Cash Transfers (CCTs) have spread through the developing world in the past two decades. It is often assumed that CCTs enjoy political support in the population precisely because they impose conditions on beneficiaries. This article employs survey experiments in Brazil and Turkey to determine whether, and in what contexts, making government transfers conditional on the behavior of beneficiaries increases political support for the programs. Results show that conditional transfers are only marginally more popular than similar unconditional transfers in nationally representative samples, but that this difference is substantially larger among the better-off and among those primed to think of themselves as different from beneficiaries. These findings imply that conditionalities per se are not as strong a determinant of support for transfers as the literature suggests, but that they can still be helpful in building support for transfers among subsets of the population that are least likely to support them.
Tag clouds have been utilized as a “social” way to find and visualize information, providing both one-click access and a snapshot of the “aboutness” of a tagged collection. While many research projects have explored and compared various tag artifacts using information theory and simulations, fewer studies have been conducted to compare the effectiveness of different tag-based browsing interfaces from the user’s point of view. This research aims to investigate how users utilize tags in image search context and to what extent different organizations of tag browsing interfaces are useful for image search. We conducted two experiments to explore user behavior and performance with three interfaces: two tag-enabled interfaces (the regular and faceted tag-clouds) and a baseline (search-only) interface. Our results demonstrate the value of tags in the image search context, the role of tags in the exploratory search, and the strengths of two kinds of tag organization explored in this paper.
The Modified Mercalli intensity scale (Mercalli scale for short) is a qualitative measure used to express the perceived intensity of an earthquake in terms of damages. Accurate intensity reports are vital to estimate the type of emergency response required for a particular earthquake. In addition, Mercalli scale reports are needed to estimate the possible consequences of strong earthquakes in the future, based on the effects of previous events. Emergency offices and seismological agencies worldwide are in charge of producing Mercalli scale reports for each affected location after an earthquake. However, this task relies heavily on human observers in the affected locations, who are not always available or accurate. Consequently, Mercalli scale reports may take up to hours or even days to be published after an earthquake. We address this problem by proposing a method for early prediction of spatial Mercalli scale reports based on people’s reactions to earthquakes in social networks. By tracking users’ comments about real-time earthquakes, we create a collection of Mercalli scale point estimates at municipality (i.e., state subdivisions) level granularity. We introduce the concept of reinforced Mercalli support, which combines Mercalli scale point estimates with locally supported data (named ‘local support’). We use this concept to provide Mercalli scale estimates for real-world events by providing smooth point estimates using a spatial smoother that incorporates the distribution of municipalities in each affected region. Our method is the first method based on social media that can provide spatial reports of damages in the Mercalli intensity scale. Experimental results show that our method is accurate and provides early spatial Mercalli reports 30 minutes after an earthquake. Furthermore, we show that our method performs well for earthquake spatial detection and maximum intensity prediction tasks. Our findings indicate that social media is a valuable source of spatial information for quickly estimating earthquake damages.
We introduce a compressed data structure for the storage of free trajectories of moving objects that efficiently supports various spatio-temporal queries. Our structure, dubbed GraCT, stores the absolute positions of all the objects at regular time intervals (snapshots) using a k2-tree, which is a space- and time-efficient region quadtree. Positions between snapshots are represented as logs of relative movements and compressed using a grammar-based compressor. The non-terminals of this grammar are enhanced with MBR information to enable fast queries.
The GraCT structure of a dataset occupies less than the raw data compressed with a powerful traditional compressor. Further, instead of requiring full decompression to access the data like a traditional compressor, GraCT supports direct access to object trajectories or to their position at specific time instants, as well as spatial range and nearest-neighbor queries on time instants and/or time intervals.
Compared to traditional methods for storing and indexing spatio-temporal data, GraCT requires two orders of magnitude less space and is competitive in query times. In particular, thanks to its compressed representation, the GraCT structure may reside in main memory in situations where any classical uncompressed index must resort to disk, thereby being one or two orders of magnitude faster.
A key aspect of visual question answering (VQA) models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.
Query processing is an important challenge for spatial databases due to the use of complex data types that represent spatial attributes. In particular, due to the cost of spatial joins, several optimization algorithms based on indexing structures exist. The work in this paper proposes a strategy for semantic query optimization of spatial join queries. The strategy detects queries with empty results and rewrites queries to eliminate unnecessary spatial joins or to replace spatial by thematic joins. This is done automatically by analyzing the semantics imposed by the database schema through topological dependencies and topological referential integrity constraints. In this way, the strategy comes to complement current state-of-art algorithms for processing spatial join queries. The experimental evaluation with real data sets shows that the optimization strategy can achieve a decrease in the time cost of a join query using indexing structures in a spatial database management system (SDBMS).
People’s perceptions of the built environment influence the way they use and navigate it. Understanding these perceptions may be useful to inform the design, management and planning process of public spaces. Recently, several studies have used data collected at a massive scale and machine learning methods to quantify these perceptions, showing promising results in terms of predictive performance. Nevertheless, most of these models can be of little help in understanding users’ perceptions due to the difficulty associated with identifying the importance of each attribute of landscapes. In this work, we propose a novel approach to quantify perceptions of landscapes through discrete choice models, using semantic segmentations of images of public spaces, generated through machine learning algorithms, as explanatory variables. The proposed models are estimated using the Place Pulse dataset, with over 1.2 million perceptual indicators, and are able to provide useful insights into how users perceive the built environment as a function of its features. The models obtained are used to infer perceptual variables in the city of Santiago, Chile, and show they have a significant correlation with socioeconomic indicators.
Alternatives to recurrent neural networks, in particular, architectures based on attention or convolutions, have been gaining momentum for processing input sequences. In spite of their relevance, the computational properties of these alternatives have not yet been fully explored. We study the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer (Vaswani et al., 2017) and the Neural GPU (Kaiser & Sutskever, 2016). We show both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external memory to become Turing complete. Our study also reveals some minimal sets of elements needed to obtain these completeness results.
Link: https://doi.org/10.1016/j.is.2018.06.010 https://openreview.net/pdf?id=HyGBdo0qFm
European politics at the turn of the 19th century saw a dramatic reduction in the number and diversity of polities as the territorial nation-state emerged as the dominant form of political organization. The transformation had a profound impact on the periphery. The study examines how embracing the principle of territoriality transformed relations between settler societies and indigenous peoples in South America. As this shift coincided with independence from Spain, Creole elites rapidly dismantled the remnants of imperial heteronomy, ending centuries of inter-cultural diplomacy. The study illustrates this shift in the case of the “Southern frontier,” where Spain had maintained a practice of treaty making with the Mapuche people since the mid-17th century. This long-standing practice broke down shortly after Chile gained independence in 1818. What followed was a policy of coercive assimilation through military conquest and forced displacement — a policy that settler societies implemented elsewhere in the 19th century. In contrast to explanations that emphasize the spread of capitalist agriculture and racist ideologies, this study argues that territoriality spelled the end of inter-cultural diplomacy along the “Southern frontier.”
This work presents and studies the efficiency problem of mapping GPU threads onto simplex domains. A non-linear map λ(ω) is formulated based on a block-space enumeration principle that reduces the number of thread-blocks by a factor of approximately 2× and 6× for 2-simplex and 3-simplex domains, respectively, when compared to the standard approach. Performance results show that λ(ω) is competitive and even the fastest map when ran in recent GPU architectures such as the Tesla V100, where it reaches up to 1.5× of speedup in 2-simplex tests. In 3-simplex tests, it reaches up to 2.3× of speedup for small workloads and up to 1.25× for larger ones. The results obtained make λ(ω) a useful GPU optimization technique with applications on parallel problems that define all-pairs, all-triplets or nearest neighbors interactions in a 2-simplex or 3-simplex domain.
Link: https://doi.org/10.1109/TPDS.2018.2849705
Graphs are being increasingly adopted as a flexible data model in scenarios (e.g., Google’s Knowledge Graph, Facebook’s Graph API, Wikidata, etc.) where multiple editors are involved in content creation, where the schema is ever changing, where data are incomplete, where the connectivity of resources plays a key rolescenarios where relational models traditionally struggle. But with this flexibility comes a conceptual cost: it can be difficult to summarise and understand, at a high level, the content that a given graph contains. Hence profiling graphs becomes of increasing importance to extract order, a posteriori, from the chaotic processes by which such graphs are generated. This talk will motivate the use of graphs as a data model, abstract recent trends in graph data management, and then turn to the issue of profiling and summarising graphs: what are the goals of such profiling, the principles by which graphs can be summarised, the main techniques by which this can/could be achieved The talk will emphasise the importance of profiling graphs while highlighting a variety of open research questions yet to be tackled.
Jan Van den Bussche, Marcelo Arenas: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018. ACM 2018.
This volume contains the proceedings of PODS 2018, which include a paper for the keynote addressed by Michael Benedikt (University of Oxford), abstracts based on two invited tutorials by Rajeev Raman (University of Leicester) and Arvind Narayanan (Princeton University), and 29 contributions that were selected by the Program Committee for presentation at the symposium.
In addition, this volume also contains papers from our two “Gems of PODS” speakers, Hung Ngo (Relational AI) and Phokion G. Kolaitis (UC Santa Cruz and IBM Research – Almaden). The Gems of PODS is an event, started in 2016, where the goal is to promote understanding of past seminal PODS results to the general audience. The Gems of PODS papers were selected by the Gems of PODS committee consisting of Marcelo Arenas (chair) (Pontificia Universidad Católica de Chile), Tova Milo (Tel Aviv University) and Dan Olteanu (Oxford University).
This year, PODS continued with two submission cycles that were introduced two years ago. The first cycle allowed for the possibility for papers to be revised and resubmitted. For the first cycle, 30 papers were submitted, 7 of which were directly selected for inclusion in the proceedings, and 5 were invited for a resubmission after a revision. The quality of most of the revised papers increased substantially with respect to the first submission, and 4 out of 5 revised papers were selected for the proceedings. For the second cycle, 53 papers were submitted, 18 of which were selected, resulting in 29 papers selected overall from a total number of 83 submissions.
An important task for the Program Committee has been the selection of the PODS 2018 Best Paper Award. The committee selected the paper: “Entity Matching with Active Monotone Classification” by Yufei Tao. On behalf of the committee, we would like to extend our sincere congratulations to the author!
Since 2008, PODS assigns the ACM PODS Alberto O. Mendelzon Test-of-Time Award to a paper or a small number of papers published in the PODS proceedings ten years prior that had the most impact over the intervening decade. This year’s committee, consisting of Maurizio Lenzerini, Wim Martens and Nicole Schweikardt, selected the following paper: “The Chase Revisited” by Alin Deutsch, Alan Nash and Jeff Remmel.
Recent availability of data about writing processes at keystroke-granularity has enabled research on the evolution of document writing. A natural task is to develop systems that can actually show this data, that is, user interfaces that transform the data of the process of writing –today a black box– into intelligible forms. On this line, we propose a data structure that captures a document’s fine-grained history and an organic visualization that serves as an interface to it. We evaluate a proof-of-concept implementation of the system through a pilot study using documents written by students at a public university. Our results are promising and reveal facets such as general strategies adopted, local edition density and hierarchical structure of the final text.
Link: https://doi.org/10.1145/3172944.3173004
In this paper, we propose a novel data-driven schema for large-scale heterogeneous knowledge graphs inspired by Formal Concept Analysis (FCA). We first extract the sets of properties associated with individual entities; these property sets (aka. characteristic sets) are annotated with cardinalities and used to induce a lattice based on set-containment relations, forming a natural hierarchical structure describing the knowledge graph. We then propose an algebra over such schema lattices, which allows to compute diffs between lattices (for example, to summarise the changes from one version of a knowledge graph to another), to add lattices (for example, to project future changes), and so forth. While we argue that this lattice structure (and associated algebra) may have various applications, we currently focus on the use-case of modelling and predicting the dynamic behaviour of knowledge graphs. Along those lines, we instantiate and evaluate our methods for analysing how versions of the Wikidata knowledge graph have changed over a period of 11 weeks. We propose algorithms for constructing the lattice-based schema from Wikidata, and evaluate their efficiency and scalability. We then evaluate use of the resulting schema(ta) for predicting how the knowledge graph will evolve in future versions.
Link: https://doi.org/10.1145/3178876.3186016
Although the amount of RDF data has been steadily increasing over the years, the majority of information on the Web is still residing in other formats, and is often not accessible to Semantic Web services. A lot of this data is available through APIs serving JSON documents. In this work we propose a way of extending SPARQL with the option to consume JSON APIs and integrate the obtained information into SPARQL query answers, thus obtaining a query language allowing to bring data from the “traditional” Web to the Semantic Web. Looking to evaluate these queries as efficiently as possible, we show that the main bottleneck is the amount of API requests, and present an algorithm that produces “worst-case optimal” query plans that reduce the number of requests as much as possible. We also do a set of experiments that empirically confirm the optimality of our approach.
Graph data management concerns the research and development of powerful technologies for storing, processing and analyzing large volumes of graph data. This chapter presents an overview about the foundations and systems for graph data management. Specifically, we present a historical overview of the area, studied graph database models, characterized essential graph-oriented queries, reviewed graph query languages, and explore the features of current graph data management systems (i.e. graph databases and graph-processing frameworks).
Representing the movements of objects (trips) over a network in a compact way while retaining the capability of exploiting such data effectively is an important challenge of real applications. We present a new Compact Trip Representation (CTR) that handles the spatio-temporal data associated with users’ trips over transportation networks. Depending on the network and types of queries, nodes in the network can represent intersections, stops, or even street segments.
CTR represents separately sequences of nodes and the time instants when users traverse these nodes. The spatial component is handled with a data structure based on the well-known Compressed Suffix Array (CSA), which provides both a compact representation and interesting indexing capabilities. The temporal component is self-indexed with either a Hu–Tucker-shaped Wavelet-Tree or a Wavelet Matrix that solve range-interval queries efficiently. We show how CTR can solve relevant counting-based spatial, temporal, and spatio-temporal queries over large sets of trips. Experimental results show the space requirements (around 50–70% of the space needed by a compact non-indexed baseline) and query efficiency (most queries are solved in the range of 1–1000 µs) of CTR.
Highlights: -We provide a representation for trips over networks and answer counting-based queries. -We adapt a Compressed Suffix Array to deal with the spatial component of trips. -We use a wavelet matrix or a Hu–tucker-shaped Wavelet Tree for the temporal component. -Experiments show space needs until a 50% when compared with a plain representation. -Experiments show counting-based query-times typically within 1–1000 µs.
The aim of this research is to measure, using smart card data, how much do public transport users change their behavior through time. To quantify the change in behavior, we split the smart card records of a user into a set of separated time windows. Then, we measure the variation between each pair of time windows. Three algorithms that calculate the variation in users’ mobility are assessed. Using data from a Canadian transport system, we show that measuring the stability of user behavior at an individual level provides new insights for public transport operators, e.g., it can be used to measure users’ adaptability to changes in the transport system.
The professionalization of politics and the disappearance of party organizations based on activists seems an inescapable trend. This article shows, by studying the Broad Front of Uruguay as a deviant case, the relevance of organizational rules for explaining the reproduction of party activism. Using data from both an online survey of people differing in their levels of engagement with the Broad Front and in-depth interviews with party activists, we show that those with relatively low levels of engagement—“adherents”—and activists differ in their willingness to cooperate with the party and in the time they devote to party activities. Also, we find that reducing the perceived efficacy of political engagement strongly decreases activists’ self-reported willingness to engage with the party, while this reduction has no effect upon adherents. These findings suggest that the design of organizational rules that grant a political role to grassroots organizers can promote party activism.
DOI https://doi.org/10.1177/0010414018806528
Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal intervals when they are active. This work explores the use of the Compressed Suffix Array (CSA), a well-known compact and self-indexed data structure in the area of text indexing, to represent large temporal graphs. The new structure, called Temporal Graph CSA (TGCSA), is experimentally compared with the most competitive compact data structures in the state-of-the-art, namely, EdgeLog and CET. The experimental results show that TGCSA obtains a good space-time trade-off. It uses a reasonable space and is efficient for solving complex temporal queries. Furthermore, TGCSA has wider expressive capabilities than EdgeLog and CET, because it is able to represent temporal graphs where contacts on an edge can temporally overlap.
Highlights: -We consider the problem of representing temporal graphs in a compact way. -We can represent temporal graphs with contacts on a same edge that temporally overlap. -We design TGCSA and SHOW how it solves typical temporal queries. -We create a novel representation of Ψ that improves the performance of TGCSA. -We obtain a reasonable space/time tradeoff even on complex temporal queries.
We present a demo of the GraFa faceted search and browsing interface over the Wikidata knowledge graph. We describe the key aspects of the interface, including the types of interactions that the system allows, the ranking schemes employed, and other features to aid usability. We also discuss future plans for improving the system. Online Demo: http://grafa.dcc.uchile.cl/
Link: http://ceur-ws.org/Vol-2180/paper-44.pdf
Docker is the most popular implementation of Operating System virtualization, currently its online registry service (Docker Hub) stores more than 4.5 millions of software images. Using that registry it is possible to download and deploy Docker images as software containers. However, these images only show information of the main software, hiding the dependencies needed to run it. To allow users to track what they deploy into their machines, we developed DockerPedia, a resource that publishes information of the packages within the Docker images as Linked Data.
Currently our resource includes 28% of the most downloaded images from Docker Hub providing information about the software dependencies and its vulnerabilities allowing to easily reproduce the environment in which each image was deployed as well as to check the security of the image without the need to download it.
Link: http://ceur-ws.org/Vol-2180/paper-47.pdf
We demonstrate a system to canonicalise (aka. normalise) SPARQL queries for use-cases such as caching, query log analysis, query minimisation, signing queries, etc. Our canonicalisation method deterministically rewrites a given input query to an equivalent canonical form such that the results for two queries are syntactically (string) equal if and only if they give the same results on any database, modulo variable names. The method is sound and complete for a monotone fragment of SPARQL with selection (equalities), projection, join and union under both set and bag semantics. Considering other SPARQL features (e.g., optional, filter, graph, etc.), the underlying equivalence problem becomes undecidable, where we currently rather support a best-effort canonicalisation for other SPARQL 1.0. features. We demonstrate a prototype of our canonicalisation framework, provide example rewritings, and discuss limitations, use-cases and future work. Demo link: http://qcan.dcc.uchile.cl
One of the challenges of recent RDF-based applications is managing data quality [1], and several systems already provide RDF validation procedures (e.g., https://www. stardog.com/docs/, https://www.topquadrant.com/technology/shacl/). This created the need for a standardized declarative constraint language for RDF, and for mechanisms to detect violations of such constraints. An important step in this direction is SHACL, or Shapes Constraint Language (https://www.w3.org/TR/shacl/) which has become a W3C recommendation in 2017. The SHACL specification however leaves explicitly undefined the validation of recursive constraints. In a previous article [2], we showed that extending the specification’s semantics to accommodate for recursion leads to intractability (in the size of the graph) for the so-called “core constraint components” of SHACL. This result holds for stratified constraints already, which may come as a surprise, considering that stratification guarantees tractability in well-studied recursive languages such as Datalog. Our previous work identified a tractable fragment of SHACL’s core components. In this paper, we propose an alternative approach to gain tractability, retaining all SHACL operators, but strengthening the stratification condition traditionally used in logic programming. More exactly, we introduce a syntactic condition on shape constraints called “strict stratification”, which guarantees that graph validation is in PTIME in combined (i.e. graph and constraints) complexity. We also describe a procedure to perform such validation. The current paper is not self-contained, due to space limitations, but all definitions can be found in our previous article [2] or its online extended version [3].
Entity Linking (EL) associates the entities mentioned in a given input text with their corresponding knowledge-base (KB) entries. A recent EL trend is towards multilingual approaches. However, one may ask: are multilingual EL approaches necessary with recent advancements in machine translation? Could we not simply focus on supporting one language in the EL system and translate the input text to that language? We present experiments along these lines comparing multilingual EL systems with their results over machine translated text.
Traditional video understanding tasks include human action recognition and actor/object semantic segmentation. However, the combined task of providing semantic segmentation for different actor classes simultaneously with their action class remains a challenging but necessary task for many applications. In this work, we propose a new end-to-end architecture for tackling this task in videos. Our model effectively leverages multiple input modalities, contextual information, and multitask learning in the video to directly output semantic segmentations in a single unified framework. We train and benchmark our model on the Actor-Action Dataset (A2D) for joint actor-action semantic segmentation, and demonstrate state-of-the-art performance for both segmentation and detection. We also perform experiments verifying our approach improves performance for zero-shot recognition, indicating generalizability of our jointly learned feature space.
Link: https://doi.org/10.1007/978-3-030-01225-0_43
Explaining automatic recommendations is an active area of research since it has shown an important eect on users’ acceptance over the items recommended. However, there is a lack of research in explaining content-based recommendations of images based on visual features. In this paper, we aim to ll this gap by testing three dierent interfaces (one baseline and two novel explanation interfaces) for artistic image recommendation. Our experiments with N=121 users conrm that explanations of recommendations in the image domain are useful and increase user satisfaction, perception of explainability, relevance, and diversity. Furthermore, our experiments show that the results are also dependent on the underlying recommendation algorithm used. We tested the interfaces with two algorithms: Deep Neural Networks (DNN), with high accuracy but with dicult to explain features, and the more explainable method based on Aractiveness Visual Features (AVF). e beer the accuracy performance –in our case the DNN method– the stronger the positive eect of the explainable interface. Notably, the explainable features of the AVF method increased the perception of explainability but did not increase the perception of trust, unlike DNN, which improved both dimensions. ese results indicate that algorithms in conjunction with interfaces play a signicant role in the perception of explainability and trust for image recommendation. We plan to further investigate the relationship between interface explainability and algorithmic performance in recommender systems.
Representing the trajectories of mobile objects is a hot topic from the widespread use of smartphones and other GPS devices. However, few works have focused on representing trips over public transportation networks (buses, subway, and trains) where user’s trips can be seen as a sequence of stages performed within a vehicle shared with many other users. In this context, representing vehicle journeys reduces the redundancy because all the passengers inside a vehicle share the same arrival time for each stop. In addition, each vehicle journey follows exactly the sequence of stops corresponding to its line, which makes it unnecessary to represent that sequence for each journey.
To solve data management for transportation systems, we designed a conceptual model that gave us a better insight into this data domain and allowed us the definition of relevant terms and the detection of redundancy sources among those data. Then, we designed two compact representations focused on users’ trips (𝖳𝖳𝖢𝖳𝖱) and on vehicle trips (𝖠𝖼𝗎𝗆𝖬), respectively. Each approach owns some strengths and is able to answer some queries efficiently.
We include experimental results over synthetic trips generated from accurate schedules obtained from a real network description (from the bus transportation system of Madrid) to show the space/time trade-off of both approaches. We considered a wide range of different queries about the use of the transportation network such as counting-based/aggregate queries regarding the load of any line of the network at different times.
The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with corresponding entities in a given knowledge base. While traditional EL approaches have largely focused on English texts, current trends are towards language-agnostic or otherwise multilingual approaches that can perform EL over texts in many languages. One of the obstacles to ongoing research on multilingual EL is a scarcity of annotated datasets with the same text in different languages. In this work we thus propose VoxEL: a manually-annotated gold standard for multilingual EL featuring the same text expressed in five European languages. We first motivate and describe the VoxEL dataset, using it to compare the behaviour of state of the art EL (multilingual) systems for five different languages, contrasting these results with those obtained using machine translation to English. Overall, our results identify how five state-of-the-art multilingual EL systems compare for various languages, how the results of different languages compare, and further suggest that machine translation of input text to English is now a competitive alternative to dedicated multilingual EL configurations.
Timely detection and accurate description of extreme events, such as natural disasters and other crisis situations, are crucial for emergency management and mitigation. Extreme-event detection is challenging, since one has to rely upon reports from human observers appointed to specific geographical areas, or on an expensive and sophisticated infrastructure. In the case of earthquakes, geographically dense sensor networks are expensive to deploy and maintain. Therefore, only some regions-or even countries-are able to acquire useful information about the effects of earthquakes in their own territory. An inexpensive and viable alternative to this problem is to detect extreme real-world events through people’s reactions in online social networks. In particular, Twitter has gained popularity within the scientific community for providing access to real-time “citizen sensor” activity. Nevertheless, the massive amount of messages in the Twitter stream, along with the noise it contains, underpin a number of difficulties when it comes to Twitter-based event detection. We contribute to address these challenges by proposing an online method for detecting unusual bursts in discrete-time signals extracted from Twitter. This method only requires a one-off semisupervised initialization and can be scaled to track multiple signals in a robust manner. We also show empirically how our proposed approach, which was envisioned for generic event detection, can be adapted for worldwide earthquake detection, where we compare the proposed model to the state of the art for earthquake tracking using social media. Experimental results validate our approach as a competitive alternative in terms of precision and recall to leading solutions, with the advantage of implementation simplicity and worldwide scalability.
Faceted browsing has become a popular paradigm for user interfaces on the Web and has also been investigated in the context of RDF graphs. However, current faceted browsers for RDF graphs encounter performance issues when faced with two challenges: scale, where large datasets generate many results, and heterogeneity, where large numbers of properties and classes generate many facets. To address these challenges, we propose GraFa: a faceted browsing system for heterogeneous large-scale RDF graphs based on a materialisation strategy that performs an offline analysis of the input graph in order to identify a subset of the exponential number of possible facet combinations that are candidates for indexing. In experiments over Wikidata, we demonstrate that materialisation allows for displaying (exact) faceted views over millions of diverse results in under a second while keeping index sizes relatively small. We also present initial usability studies over GraFa.
Time series is a common and well-known way for describing temporal data. However, most of the state-of-the-art techniques for analysing time series have focused on generating a representation for a single level of resolution. For analysing of a time series at several levels of resolutions, one would require to compute different representations, one for each resolution level. We introduce a multi-resolution representation for time series based on local trends and mean values. We require the level of resolution as parameter, but it can be automatically computed if we consider the maximum resolution of the time series. Our technique represents a time series using trend-value pairs on each segment belonging to a resolution level. To provide a useful representation for data mining tasks, we also propose dissimilarity measures and a symbolic representation based on the SAX technique for efficient similarity search using a multi-resolution indexing scheme. We evaluate our method for classification and discord discovery tasks over a diversity of data domains, achieving a better performance in terms of efficiency and effectiveness compared with some of the best-known classic techniques. Indeed, for some of the experiments, the time series mining algorithms using our multi-resolution representation were an order of magnitude faster, in terms of distance computations, than the state of the art.
Visual embeddings from Convolutional Neural Networks (CNN) trained on the ImageNet dataset for the ILSVRC challenge have shown consistently good performance for transfer learning and are widely used in several tasks, including image recommendation. However, some important questions have not yet been answered in order to use these embeddings for a larger scope of recommendation domains: a) Do CNNs that perform better in ImageNet are also better for transfer learning in content-based image recommendation?, b) Does fine-tuning help to improve performance? and c) Which is the best way to perform the fine-tuning?
In this paper we compare several CNN models pre-trained with ImageNet to evaluate their transfer learning performance to an artwork image recommendation task. Our results indicate that models with better performance in the ImageNet challenge do not always imply better transfer learning for recommendation tasks (e.g. NASNet vs. ResNet). Our results also show that fine-tuning can be helpful even with a small dataset, but not every fine-tuning works. Our results can inform other researchers and practitioners on how to train their CNNs for better transfer learning towards image recommendation systems.
Link: https://arxiv.org/abs/1807.09870
The progression of action theories is an important problem in knowledge representation. Progression is second-order definable and known to be first-order definable and effectively computable for restricted classes of theories. Motivated by the fact that universal theories with constants (UTCs) are expressive and natural theories whose satisfiability is decidable, in this paper we provide a thorough study of the progression of situation calculus UTCs. First, we prove that progression of a (possibly infinite) UTC is always first-order definable and results in a UTC. Though first-order definable, we show that the progression of a UTC may be infeasible, that is, it may result in an infinite UTC that is not equivalent to any finite set of first-order sentences. We then show that deciding whether %or not there is a feasible progression of a UTC is undecidable. Moreover, we show that deciding whether %or not a sentence (in an expressive fragment of first-order logic) is in the progression of a UTC is CONEXPTIME-complete, and that there exists a family of UTCs for which the size of every feasible progression grows exponentially. Finally, we discuss resolution-based approaches to compute the progression of a UTC. This comprehensive analysis contributes to a better understanding of progression in action theories, both in terms of feasibility and difficulty.
Link: https://aaai.org/ocs/index.php/KR/KR18/paper/view/18074/17173
Blank nodes in RDF graphs can be used to represent values known to exist but whose identity remains unknown. A prominent example of such usage can be found in the Wikidata dataset where, e.g., the author of Beowulf is given as a blank node. However, while SPARQL considers blank nodes in a query as existentials, it treats blank nodes in RDF data more like constants. Running SPARQL queries over datasets with unknown values may thus lead to counter-intuitive results, which may make the standard SPARQL semantics unsuitable for datasets with existential blank nodes. We thus explore the feasibility of an alternative SPARQL semantics based on certain answers. In order to estimate the performance costs that would be associated with such a change in semantics for current implementations, we adapt and evaluate approximation techniques proposed in a relational database setting for a core fragment of SPARQL. To further understand the impact that such a change in semantics may have on query solutions, we analyse how this new semantics would affect the results of user queries over Wikidata.
Caching in the context of expressive query languages such as SPARQL is complicated by the difficulty of detecting equivalent queries: deciding if two conjunctive queries are equivalent is NP-complete, where adding further query features makes the problem undecidable. Despite this complexity, in this paper we propose an algorithm that performs syntactic canonicalisation of SPARQL queries such that the answers for the canonicalised query will not change versus the original. We can guarantee that the canonicalisation of two queries within a core fragment of SPARQL (monotone queries with select, project, join and union) is equal if and only if the two queries are equivalent; we also support other SPARQL features but with a weaker soundness guarantee: that the (partially) canonicalised query is equivalent to the input query. Despite the fact that canonicalisation must be harder than the equivalence problem, we show the algorithm to be practical for real-world queries taken from SPARQL endpoint logs, and further show that it detects more equivalent queries than when compared with purely syntactic methods. We also present the results of experiments over synthetic queries designed to stress-test the canonicalisation method, highlighting difficult cases.
Link: https://doi.org/10.1007/978-3-030-00671-6_35
With the popularity of RDF as an independent data model came the need for specifying constraints on RDF graphs, and for mechanisms to detect violations of such constraints. One of the most promising schema languages for RDF is SHACL, a recent W3C recommendation. Unfortunately, the specification of SHACL leaves open the problem of validation against recursive constraints. This omission is important because SHACL by design favors constraints that reference other ones, which in practice may easily yield reference cycles.
In this paper, we propose a concise formal semantics for the so-called “core constraint components” of SHACL. This semantics handles arbitrary recursion, while being compliant with the current standard. Graph validation is based on the existence of an assignment of SHACL “shapes” to nodes in the graph under validation, stating which shapes are verified or violated, while verifying the targets of the validation process. We show in particular that the design of SHACL forces us to consider cases in which these assignments are partial, or, in other words, where the truth value of a constraint at some nodes of a graph may be left unknown.
Dealing with recursion also comes at a price, as validating an RDF graph against SHACL constraints is NP-hard in the size of the graph, and this lower bound still holds for constraints with stratified negation. Therefore we also propose a tractable approximation to the validation problem.
Two-level indexes have been widely used to handle trajectories of moving objects that are constrained to a network. The top-level of these indexes handles the spatial dimension, whereas the bottom level handles the temporal dimension. The latter turns out to be an instance of the interval-intersection problem, but it has been tackled by non-specialized spatial indexes. In this work, we propose the use of a compact data structure on the bottom level of these indexes. Our experimental evaluation shows that our approach is both faster and smaller than existing solutions.
Link: https://doi.org/10.1007/978-3-030-00479-8_28
Cost register automata (CRA) and its subclass, copyless CRA, were recently proposed by Alur et al. as a new model for computing functions over strings. We study some structural properties, expressiveness, and closure properties of copyless CRA. We show that copyless CRA is strictly less expressive than weighted automata and is not closed under reverse operation. To find a better class we impose restrictions on copyless CRA, which ends successfully with a new robust computational model that is closed under reverse and other extensions.
We present the first solution to τ-majorities on tree paths. Given a tree of n nodes, each with a label from [1..σ], and a fixed threshold 0<τ<1, such a query gives two nodes u and v and asks for all the labels that appear more than τ⋅|Puv| times in the path Puv from u to v, where |Puv| denotes the number of nodes in Puv. Note that the answer to any query is of size up to 1/τ. On a w-bit RAM, we obtain a linear-space data structure with O((1/τ)log∗nloglogwσ) query time. For any κ>1, we can also build a structure that uses O(nlog[κ]n) space, where log[κ]n denotes the function that applies logarithm κ times to n, and answers queries in time O((1/τ)loglogwσ). The construction time of both structures is O(nlogn). We also describe two succinct-space solutions with the same query time of the linear-space structure. One uses 2nH+4n+o(n)(H+1) bits, where H≤lgσ is the entropy of the label distribution, and can be built in O(nlogn) time. The other uses nH+O(n)+o(nH) bits and is built in O(nlogn) time w.h.p.
An important aspect of exploratory search over graph data is to understand what paths connect a given pair of nodes. Since the resulting paths can be manifold, various works propose ranking paths likely to be of interest to a user; these methods often rely on enumerating all such paths (up to a fixed length or number) before ranking is applied. In this paper, we instead propose applying a shortest path search on weighted versions of the graph in order to directly compute the most relevant path(s) between two nodes without fixed-length bounds, further obviating the need to enumerate irrelevant paths. We investigate weightings based on node degree, PageRank and edge frequency, contrasting the paths produced by these schemes over the Wikidata graph and discussing performance issues. Finally we conduct a user study over Wikidata where evaluators assess the quality of the paths produced; though inter-rater consensus on which paths are of most interest is low, we achieve statistically significant results to suggest that users find the weighted shortest paths more interesting than the baseline shortest paths without weights.
Experiment reproducibility is the ability to run an experiment with the introduction of changes to it and getting results that are consistent with the original ones. To allow reproducibility, the scientific community encourages researchers to publish descriptions of the these experiments. However, these recommendations do not include an automated way for creating such descriptions: normally scientists have to annotate their experiments in a semi automated way. In this paper we propose a system to automatically describe computational environments used in in-silico experiments. We propose to use Operating System (OS) virtualization (containerization) for distributing software experiments throughout software images and an annotation system that will allow to describe these software images. The images are a minimal version of an OS (container) that allow the deployment of multiple isolated software packages within it.
Conjunctive queries (CQs) fail to provide an answer when the pattern described by the query does not exactly match the data. CQs might thus be too restrictive as a querying mechanism when data is semistructured or incomplete. The semantic web therefore provides a formalism—known as (projected) well-designed pattern trees (pWDPTs)—that tackles this problem: pWDPTs allow us to formulate queries that match parts of the query over the data if available, but do not ignore answers of the remaining query otherwise. Here we abstract away the specifics of semantic web applications and study pWDPTs over arbitrary relational schemas. Since the language of pWDPTs subsumes CQs, their evaluation problem is intractable. We identify structural properties of pWDPTs that lead to (fixed-parameter) tractability of various variants of the evaluation problem. We also show that checking if a pWDPT is equivalent to one in our tractable class is in 2EXPTIME. As a corollary, we obtain fixed-parameter tractability of evaluation for pWDPTs with such good behavior. Our techniques also allow us to develop a theory of approximations for pWDPTs.
Link: https://doi.org/10.1145/3233983
We present three pumping lemmas for three classes of functions definable by fragments of weighted automata over the min-plus semiring and the semiring of natural numbers. As a corollary we show that the hierarchy of functions definable by unambiguous, finitely-ambiguous, polynomiallyambiguous weighted automata, and the full class of weighted automata is strict for the minplus semiring.
Link: http://drops.dagstuhl.de/opus/volltexte/2018/8498/pdf/LIPIcs-STACS-2018-50.pdf
Domains for spatial and temporal data are often multigranular in nature, possessing a natural order structure defined by spatial inclusion and time-interval inclusion, respectively. This order structure induces lattice-like (partial) operations, such as join, which in turn lead to join rules, in which a single domain element (granule) is asserted to be equal to, or contained in, the join of a set of such granules. In general, the efficient representation of such join rules is a difficult problem. However, there is a very effective representation in the case that the rule is bigranular; i.e., all of the joined elements belong to the same granularity, and, in addition, complete information about the (non)disjointness of all granules involved is known. The details of that representation form the focus of the paper.
Link: https://doi.org/10.1007/978-3-319-98809-2_23
Using Chile as a case study for understanding tactical distribution under extensive controls on expenditure, this paper examines whether political motives affect the allocation of funds from the central government to localities. Collecting local-level data of two infrastructure funding programs and using the voting gap percentage between the coalition candidate and opposition competitors in a Sharp Regression Discontinuity methodology, we find causal evidence in favor of three hypotheses: (i) a coalition criterion influences the funding allocation to the local level; (ii) an electoral cycle exists in local funding; and (iii) the degree of coalition targeting varies based on a locality’s history of coalition alignment. In sum, the central government regards politically aligned mayors as valuable electoral assets, especially in municipalities historically aligned with the coalition.
Ontology-based data access (OBDA) is a successful application of knowledge representation and reasoning technologies in information management systems. One premier goal is to facilitate access to data that is heterogeneous and incomplete. This is achieved via an ontology that enriches the user query, typically a union of conjunctive queries, with domain knowledge. It turned out that the ontology and the user query can be seen as two components of one composite query, called ontology-mediated query (OMQ).
The problem of answering OMQs is thus central to OBDA. There is a consensus that the required level of scalability in OMQ answering can be achieved by using standard database management systems. To this end, a standard approach used nowadays is query rewriting: the ontology O and the database query q are combined into a new query qO, the so-called rewriting, which gives the same answer as the OMQ consisting of O and q over all input databases. It is of course essential that the rewriting qO is expressed in a language that can be handled by standard database systems. The typical language that is considered is the class of first-order (FO) queries.
In this work, we focus on two central OMQ languages based on guarded and frontierguarded tuple-generating dependencies (TGDs), and we study the problem whether an OMQ is FO-rewritable, i.e, it can be equivalently expressed as a first-order query. Recall that a guarded (resp., frontier-guarded) TGD is a sentence of the form ∀x, ¯ y¯(φ(¯x, y¯) → ∃z ψ¯ (¯x, z¯)), where φ and ψ are conjunctions of relational atoms, and φ has an atom that contains all the variables (¯x ∪ y¯) (resp., x¯) [1, 8]. Our goal is to develop specially tailored techniques that allow us to understand the above non-trivial problem, and also to pinpoint its computational complexity. To this end, as we discuss below, we follow two different approaches. Our results can be summarized as follows:
-We first focus on the simpler OMQ language based on guarded TGDs and atomic queries, and, in Section 2, we provide a characterization of FO-rewritability that forms the basis for applying tree automata techniques.
-We then exploit, in Section 3, standard two-way alternating parity tree automata. In particular, we reduce our problem to the problem of checking the finiteness of the language of an automaton. The reduction relies on a refined version of the characterization of FO-rewritability established in Section 2. This provides a transparent solution to our problem based on standard tools, but it does not lead to an optimal result.
-Towards an optimal result, we use, in Section 4, a more sophisticated automata model, known as cost automata. In particular, we reduce our problem to the problem of checking the boundedness of a cost automaton. This allows us to show that FOrewritability for OMQs based on guarded TGDs and atomic queries is in 2EXPTIME, and in EXPTIME for predicates of bounded arity. The complexity analysis relies on an intricate result on the boundedness problem for a certain class of cost automata [5, 9].
-Finally, in Section 5, by using the results of Section 4, we provide a complete picture for the complexity of our problem, i.e., deciding whether an OMQ based on (frontier-)guarded TGDs and arbitrary (unions of) conjunctive queries is FO-rewritable.
Link: http://ceur-ws.org/Vol-2100/paper8.pdf
We focus on ontology-mediated queries (OMQs) based on (frontier-)guarded existential rules and (unions of) conjunctive queries, and we investigate the problem of FO-rewritability, i.e., whether an OMQ can be rewritten as a first-order query. We adopt two different approaches. The first approach employs standard two-way alternating parity tree automata. Although it does not lead to a tight complexity bound, it provides a transparent solution based on widely known tools. The second approach relies on a sophisticated automata model, known as cost automata. This allows us to show that our problem is 2EXPTIME-complete. In both approaches, we provide semantic characterizations of FO-rewritability that are of independent interest.
Link: https://doi.org/10.24963/ijcai.2018/236
Automatic synthesis of software from specification is one of the classic problems in computer science. In the last decade, significant advances have been made in the synthesis of programs from specifications expressed in Linear Temporal Logic (LTL). LTL synthesis technology is central to a myriad of applications from the automated generation of controllers for Internet of Things devices, to the synthesis of control software for robotic applications. Unfortunately, the number of existing tools for LTL synthesis is limited, and using them requires specialized expertise. In this paper we present SynKit, a tool that offers LTL synthesis as a service. SynKit integrates a RESTful API and a web service with an editor, a solver, and a strategy visualizer.
Link: https://doi.org/10.24963/ijcai.2018/848
This paper provides an overview of a model for capturing properties of client-server-based query computation setups. This model can be used to formally analyze different combinations of client and server capabilities, and compare them in terms of various fine-grain complexity measures. While the motivations and the focus of the presented work are related to querying the Semantic Web, the main concepts of the model are general enough to be applied in other contexts as well.
In this paper, we address the problem of LTL realizability and synthesis. State of the art techniques rely on so-called bounded synthesis methods, which reduce the problem to a safety game. Realizability is determined by solving synthesis in a dual game. We provide a unified view of duality, and introduce novel bounded realizability methods via reductions to reachability games. Further, we introduce algorithms, based on AI automated planning, to solve these safety and reachability games. This is the the first complete approach to LTL realizability and synthesis via automated planning. Experiments illustrate that reductions to reachability games are an alternative to reductions to safety games, and show that planning can be a competitive approach to LTL realizability and synthesis.
Link: https://doi.org/10.24963/ijcai.2018/651
Conjunctive query (CQ) evaluation is NP-complete, but becomes tractable for fragments of bounded hypertreewidth. If a CQ is hard to evaluate, it is thus useful to evaluate an approximation of it in such fragments. While underapproximations (i.e., those that return correct answers only) are well-understood, the dual notion of overapproximations that return complete (but not necessarily sound) answers, and also a more general notion of approximation based on the symmetric difference of query results, are almost unexplored. In fact, the decidability of the basic problems of evaluation, identification, and existence of those approximations, is open. We develop a connection with existential pebble game tools that allows the systematic study of such problems. In particular, we show that the evaluation and identification of overapproximations can be solved in polynomial time. We also make progress in the problem of existence of overapproximations, showing it to be decidable in 2EXPTIME over the class of acyclic CQs. Furthermore, we look at when overapproximations do not exist, suggesting that this can be alleviated by using a more liberal notion of overapproximation. We also show how to extend our tools to study symmetric difference approximations. We observe that such approximations properly extend under- and over-approximations, settle the complexity of its associated identification problem, and provide several results on existence and evaluation.
We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.
Link: https://doi.org/10.1145/3183713.3190654
LTL synthesis is the task of generating a strategy that satisfies a Linear Temporal Logic (LTL) specification interpreted over infinite traces. In this paper we examine the problem of LTLf synthesis, a variant of LTL synthesis where the specification of the behaviour of the strategy we generate is interpreted over finite traces — similar to the assumption we make in many planning problems, and important for the synthesis of business processes and other system interactions of finite duration. Existing approaches to LTLf synthesis transform LTLf into deterministic finite-state automata (DFA) and reduce the synthesis problem to a DFA game. Unfortunately, the DFA transformation is worst-case double-exponential in the size of the formula, presenting a computational bottleneck. In contrast, our approach exploits non-deterministic automata, and we reduce the synthesis problem to a non-deterministic planning problem. We leverage our approach not only for strategy generation but also to generate certificates of unrealizability — the first such method for LTLf. We employ a battery of techniques that exploit the structure of the LTLf specification to improve the efficiency of our transformation to automata. We combine these techniques with lazy determinization of automata and on-the-fly state abstraction. We illustrate the effectiveness of our approach on a set of established LTL synthesis benchmarks adapted to finite LTL.
Link: https://aaai.org/ocs/index.php/ICAPS/ICAPS18/paper/view/17790
Many efforts have been dedicated to identifying restrictions on ontologies expressed as tuple-generating dependencies (tgds), a.k.a. existential rules, that lead to the decidability of answering ontology-mediated queries (OMQs). This has given rise to three families of formalisms: guarded, non-recursive, and sticky sets of tgds. We study the containment problem for OMQs expressed in such formalisms, which is a key ingredient for solving static analysis tasks associated with them. Our main contribution is the development of specially tailored techniques for OMQ containment under the classes of tgds stated above. This enables us to obtain sharp complexity bounds for the problems at hand.
Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner’s output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results.
Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, all language proposals introduced so far are designed to output relations, thus making them incapable of handling incomplete information. To remedy the situation, we propose to extend information extraction languages with the ability to use mappings, thus allowing us to work with documents which have missing or optional parts. Using this approach, we simplify the semantics of regex formulas and extraction rules, two previously defined methods for extracting information. We extend them with the ability to handle incomplete data, and study how they compare in terms of expressive power. We also study computational properties of these languages, focusing on the query enumeration problem, as well as satisfiability and containment.
Since its invention, the Web has evolved into the largest multimedia repository that has ever existed. This evolution is a direct result of the explosion of user-generated content, explained by the wide adoption of social network platforms. The vast amount of multimedia content requires effective management and retrieval techniques. Nevertheless, Web multimedia retrieval is a complex task because users commonly express their information needs in semantic terms, but expect multimedia content in return. This dissociation between semantics and content of multimedia is known as the semantic gap. To solve this, researchers are looking beyond content-based or text-based approaches, integrating novel data sources. New data sources can consist of any type of data extracted from the context of multimedia documents, defined as the data that is not part of the raw content of a multimedia file. The Web is an extraordinary source of context data, which can be found in explicit or implicit relation to multimedia objects, such as surrounding text, tags, hyperlinks, and even in relevance-feedback. Recent advances in Web multimedia retrieval have shown that context data has great potential to bridge the semantic gap. In this article, we present the first comprehensive survey of context-based approaches for multimedia information retrieval on the Web. We introduce a data-driven taxonomy, which we then use in our literature review of the most emblematic and important approaches that use context-based data. In addition, we identify important challenges and opportunities, which had not been previously addressed in this area.
Link: https://doi.org/10.1007/s11042-017-4997-y
The news industry has undergone a revolution in the past decade, with substantial changes continuing to this day. News consumption habits are changing due to the increase in the volume of news and the variety of sources. Readers need new mechanisms to cope with this vast volume of information in order to not only find a signal in the noise, but also to understand what is happening in the world given the multiple points of view describing events. These challenges in journalism relate to Information Retrieval (IR) and Natural Language Processing (NLP) fields such as: verification of a source’s reliability; the integration of news with other sources of information; real-time processing of both news content and social streams; de-duplication of stories; and entity detection and disambiguation. Although IR and NLP have been applied to news for decades, the changing nature of the space requires fresh approaches and a closer collaboration with our colleagues from the journalism environment. Following the success of the previous version of the workshop (NewsIR’16), the goal of this workshop, held in conjunction with ECIR 2018, is to continue to stimulate such discussion between the communities and to share interesting approaches to solve real user problems. A total number of 19 submissions were received and reviewed, of which 12 were accepted for presentation. In addition to that, we had over 30 registered participants in the workshop who were pleased to attend the two keynote talks given by well-known experts in the field – Edgar Meij (from industry) and Peter Tolmie (from academia) and oral and poster presentations from the accepted papers. The workshop also included a breakout session to discuss ideas for a future data challenge in news IR and closed with a focused panel discussion to reflect on the day. In summary, several ideas were presented in the workshop on solving complex information needs in the news domain. In addition, the workshop concluded with suggestions of important challenges and shared tasks to work on as a community for News IR.
The Modified Mercalli Intensity Scale is a measure of the severity of an earthquake for a nonscientist. Since the Mercalli scale is based on perceived effects, it has a strong dependence on observers. Typically, these reports take time to be prepared and, as a consequence, Mercalli intensities are published hours after the occurrence of an earthquake. The National Seismological Center of Chile needs to provide a preliminary overview of the observed effects of an earthquake. This has motivated us to create a system for early tracking of people’s reaction in social networks to infer Mercalli intensities. By tracking people’s comments about the effects of an earthquake, a collection of Mercalli point estimates is retrieved at county level of granularity. We introduce the concept of Reinforced Mercalli support that combines Mercalli point estimates with social support, allowing to discard social unsupported estimates. Experimental results show that our proposal is accurate providing early Mercalli reports 30 min after an earthquake, detecting the maximum Mercalli intensity of an event with high accuracy in terms of mean absolute error (MAE).
In general, existing methods for automatically detecting emergency situations using Twitter rely on features based on domain-specific keywords found in messages. This type of keyword-based methods usually require training on domain-specific labeled data, using multiple languages, and for different types of events (e.g., earthquakes, floods, wildfires, etc.). In addition to being costly, these approaches may fail to detect previously unexpected situations, such as uncommon catastrophes or terrorist attacks. However, collective mentions of certain keywords are not the only type of self-organizing phenomena that may arise in social media when a real-world extreme situation occurs. Just as nearby physical sensors become activated when stimulated, localized citizen sensors (i.e., users) will also react in a similar manner. To leverage this information, we propose to use self-organized activity related to geolocations to identify emergency situations. We propose to detect such events by tracking the frequencies, and probability distributions of the interarrival time of the messages related to specific locations. Using an off-the-shelf classifier that is independent of domain-specific features, we study and describe emergency situations based solely on location-based features in messages. Our findings indicate that anomalies in location-related social media user activity indeed provide information for automatically detecting emergency situations independent of their domain.
Link: https://doi.org/10.1145/3201064.3201077
Some decades have passed since the concept of “named entity” was used for the first time. Since then, new lines of research have emerged in this environment, such as linking the (named) entity mentions in a text collection with their corresponding knowledge-base entries.
However, this introduces problems with respect to a consensus on the definition of the concept of “entity” in the literature. This paper aims to highlight the importance of formalizing the concept of “entity” and the benefits it would bring to the Entity Linking community, in particular relating to the construction of gold standards for evaluation purposes.
Validating RDF graphs against constraints has gained interest in recent years, due to the popularity of RDF and the growth of knowledge bases. SHACL, a constraint language for RDF, has recently become a W3C recommendation, with a specification detailing syntax, semantics and common use cases. Unfortunately, this (otherwise complete) specification does not cover validation against recursive constraints. This omission is important, because SHACL by design favors constraint references. We investigate the possibility of a formal semantics for SHACL which covers the recursive case, while being compliant with the current standard.
Most of the current graph database systems have been designed to support property graphs. Surprisingly, there is no standard specification of the database model behind such systems. This paper presents a formal definition of the property graph database model. Specifically, we define the property graph data structure, basic notions of integrity constraints (e.g. graph schema), and a graph query language.
The use of the join operator in metric spaces leads to what is known as a similarity join, where objects of two datasets are paired if they are somehow similar. We propose an heuristic that solves the 1-NN selfsimilarity join, that is, a similarity join of a dataset with itself, that brings together each element with its nearest neighbor within the same dataset. Solving the problem using a simple brute-force algorithm requires O(n 2 ) distance calculations, since it requires to compare every element against all others. We propose a simple divide-and-conquer algorithm that gives an approximated solution for the self-similarity join that computes only O(n 3 2 ) distances. We show how the algorithm can be easily modified in order to improve the precision up to 31% (i.e., the percentage of correctly found 1-NNs) and such that 79% of the results are within the 10-NN, with no significant extra distance computations. We present how the algorithm can be executed in parallel and prove that using Θ( √ n) processors, the total execution takes linear time. We end discussing ways in which the algorithm can be improved in the future.
AMW 2018, Alberto Mendelzon Workshop on Foundations of Data Management.
Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management Cali, Colombia, May 21-25, 2018.
Edited by
Dan Olteanu, University of Oxford, UK
Bárbara Poblete, University of Chile, Chile.
Graph-based data models [1] have become increasingly common in data management scenarios that require flexibility beyond what is offered by traditional relational databases. Such flexibility is particularly important in Web scenarios, where potentially many users may be involved (either directly or indirectly) in the creation, management, and curation of data. An example of such a scenario is the Wikidata knowledge graph [2] where users can add new properties and types that can be used to define further data. The flip-side of flexibility is higher levels of heterogeneity. Conceptually understanding the current state of a knowledge graph – in terms of what data it contains, what it is missing, how it can be effectively queried, what has changed recently, etc. – is thus a major challenge: it is unclear how to distil an adequate, high-level description that captures an actionable overview of knowledge graphs. We thus need well-founded methodologies to make sense of knowledge graphs, where an obvious approach is to define some notion of schema for such graphs.
The traditional approach in the Semantic Web has been what Pham and Boncz [3] call the schema first approach, which defines the schema that the data should follow. The most established language for specifying such schemas is RDFS. An alternative to the schema first approach is the schema last approach [3], which foregoes an upfront schema and rather lets the data evolve naturally; thereafter, the goal is to understand what the legacy graph data contain by extracting highlevel summaries that characterise the graph, resulting in a data-driven schema. In this paper, we summarise recently published results [4] on a novel approach to compute a data-driven schema from knowledge graphs. We believe that such schemas are useful for understanding what a knowledge graph contains, and how it can be queried, among several other use-cases. Nevertheless, in this work we focus on the use-case of predicting how the knowledge graph will evolve in future versions, which could be used for measuring the time-to-live of cached SPARQL results, identifying missing properties for entities, etc.
We consider the problem of determining whether organizations facing a new data-transformation task can avoid building a new transformation procedure from scratch by reusing their stored procedures. Because it can be difficult to obtain exact descriptions of what stored procedures do, our framework abstracts data-transforming tools as black-box procedures, in which a procedure description indicates the parts of the database that might be modified by the procedure and constraints on the states of the database that must hold before and after the application of this procedure.
In this paper we present our framework and study the problem of determining, given a database and a set of procedures, whether there is a sequence of procedures from this set such that their application to the database results in the satisfaction of a boolean query. This data readiness problem is undecidable in general, but we show decidability for a broad and realistic class of procedures.
GraphQL is a recently proposed, and increasingly adopted, conceptual framework for providing a new type of data access interface on the Web. The framework includes a new graph query language whose semantics has been specified informally only. This has prevented the formal study of the main properties of the language. We embark on the formalization and study of GraphQL. To this end, we first formalize the semantics of GraphQL queries based on a labeled-graph data model. Thereafter, we analyze the language and show that it admits really efficient evaluation methods. In particular, we prove that the complexity of the GraphQL evaluation problem is NL-complete. Moreover, we show that the enumeration problem can be solved with constant delay. This implies that a server can answer a GraphQL query and send the response byte-by-byte while spending just a constant amount of time between every byte sent. Despite these positive results, we prove that the size of a GraphQL response might be prohibitively large for an internet scenario. We present experiments showing that current practical implementations suffer from this issue. We provide a solution to cope with this problem by showing that the total size of a GraphQL response can be computed in polynomial time. Our results on polynomial-time size computation plus the constant-delay enumeration can help developers to provide more robust GraphQL interfaces on the Web.
Despite its importance to the Web, multimedia content is often neglected when building and designing knowledge-bases: though descriptive metadata and links are often provided for images, video, etc., the multimedia content itself is often treated as opaque and is rarely analysed. IMGpedia is an effort to bring together the images of Wikimedia Commons (including visual information), and relevant knowledge-bases such as Wikidata and DBpedia. The result is a knowledge-base that incorporates similarity relations between the images based on visual descriptors, as well as links to the resources of Wikidata and DBpedia that relate to the image. Using the IMGpedia SPARQL endpoint, it is then possible to perform visuo-semantic queries, combining the semantic facts extracted from the external resources and the similarity relations of the images. This paper presents a new web interface to browse and explore the dataset of IMGpedia in a more friendly manner, as well as new visuo-semantic queries that can be answered using 6 million recently added links from IMGpedia to Wikidata. We also discuss future directions we foresee for the IMGpedia project.
We research the problem of building knowledge maps of graph-like information. We live in the digital era and similarly to the Earth, the Web is simply too large and its interrelations too complex for anyone to grasp much of it through direct observation. Thus, the problem of applying cartographic principles also to digital landscapes is intriguing. We introduce a mathematical formalism that captures the general notion of map of a graph and enables its development and manipulation in a semi-automated way. We describe an implementation of our formalism on the Web of Linked Data graph and discuss algorithms that efficiently generate and combine (via an algebra) regions and maps. Finally, we discuss examples of knowledge maps built with a tool implementing our framework.
Tim Berners-Lee, Sarven Capadisli, Stefan Dietze, Aidan Hogan, Krzysztof Janowicz, Jens Lehmann. LDOW@WWW 2018, Lyon, France April 23rd, 2018. CEUR Workshop Proceedings 2073, CEUR-WS.org 2018.
Linear Temporal Logic ( 𝖫𝖳𝖫 ) synthesis can be understood as the problem of building a controller that defines a winning strategy, for a two-player game against the environment, where the objective is to satisfy a given 𝖫𝖳𝖫 formula. It is an important problem with applications in software synthesis, including controller synthesis. In this paper we establish the correspondence between 𝖫𝖳𝖫 synthesis and fully observable non-deterministic (FOND) planning. We study 𝖫𝖳𝖫 interpreted over both finite and infinite traces. We also provide the first explicit compilation that translates an 𝖫𝖳𝖫 synthesis problem to a FOND problem. Experiments with state-of-the-art 𝖫𝖳𝖫 FOND and synthesis solvers show automated planning to be a viable and effective tool for highly structured 𝖫𝖳𝖫 synthesis problems.
Info-boxes provide a summary of the most important meta-data relating to a particular entity described by a Wikipedia article. However, many articles have no info-box or have info-boxes with only minimal information; furthermore, there is a huge disparity between the level of detail available for info-boxes in English articles and those for other languages. Wikidata has been proposed as a central repository of facts to try to address such disparities, and has been used as a source of information to generate info-boxes. However, current processes still rely on human intervention either to create generic templates for entities of a given type or to create a specific info-box for a specific article in a specific language. As such, there are still many articles of Wikipedia without info-boxes but where relevant data are provided by Wikidata. In this paper, we investigate fully automatic methods to generate info-boxes for Wikipedia from the Wikidata knowledge graph. The primary challenge is to create ranking mechanisms that provide an intuitive prioritisation of the facts associated with an entity. We discuss this challenge, propose several straightforward metrics to prioritise information in info-boxes, and present an initial user evaluation to compare the quality of info-boxes generated by various metrics.
Navigational queries over RDF data are viewed as one of the main applications of graph query languages, and yet the standard model of graph databases—essentially labeled graphs—is different from the triples-based model of RDF. While encodings of RDF databases into graph data exist, we show that even the most natural ones are bound to lose some functionality when used in conjunction with graph query languages. The solution is to work directly with triples, but then many properties taken for granted in the graph database context (e.g., reachability) lose their natural meaning.
Our goal is to introduce languages that work directly over triples and are closed, i.e., they produce sets of triples, rather than graphs. Our basic language is called TriAL, or Triple Algebra: it guarantees closure properties by replacing the product with a family of join operations. We extend TriAL with recursion and explain why such an extension is more intricate for triples than for graphs. We present a declarative language, namely a fragment of datalog, capturing the recursive algebra. For both languages, the combined complexity of query evaluation is given by low-degree polynomials. We compare our language with previously studied graph query languages such as adaptations of XPath, regular path queries, and nested regular expressions; many of these languages are subsumed by the recursive triple algebra. We also provide an implementation of recursive TriAL on top of a relational query engine, and we show its usefulness by running a wide array of navigational queries over real-world RDF data, while at the same time testing how our implementation compares to existing RDF systems.
The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.
In the context of protein engineering and biotechnology, the discovery and characterization of structural patterns is very relevant as it can give fundamental insights about protein structures. In this paper we present GSP4PDB, a bioinformatics web tool that lets the users design, search and analyze protein-ligand structural patterns inside the Protein Data Bank (PDB). The novel feature of GSP4PDB is that a protein-ligand structural pattern is graphically designed as a graph such that the nodes represent protein’s components and the edges represent structural relationships. The resulting graph pattern is transformed into a SQL query, and executed in a PostgreSQL database system where the PDB data is stored. The results of the search are presented using a textual representation, and the corresponding binding-sites can be visualized using a JSmol interface.
Community Question Answering (cQA) sites have emerged as platforms designed specifically for the exchange of questions and answers among users. Although users tend to find good quality answers in cQA sites, they also engage in a significant volume of QA interactions in other platforms, such as microblog networking sites. This in part is explained because microblog platforms contain up-to-date information on current events, provide rapid information propagation, and have social trust.
Despite the potential of microblog platforms, such as Twitter, for automatic QA retrieval, how to leverage them for this task is not clear. There are unique characteristics that differentiate Twitter from traditional cQA platforms (e.g., short message length, low quality and noisy information), which do not allow to directly apply prior findings in the area. In this work, we address this problem by studying: (1) the feasibility of Twitter as a QA platform and (2) the discriminating features that identify relevant answers to a particular query. In particular, we create a document model at conversation-thread level, which enables us to aggregate microblog information, and set up a learning-to-rank framework, using factoid QA as a proxy task. Our experimental results show microblog data can indeed be used to perform QA retrieval effectively. We identify domain-specific features and combinations of those features that better account for improving QA ranking, achieving a MRR of 0.7795 (improving 62% over our baseline method). In addition, we provide evidence that our method allows to retrieve complex answers to non-factoid questions.
Dyaa Albakour, David Corney, Julio Gonzalo, Miguel Martinez, Barbara Poblete, Andreas Valochas: Proceedings of the Second International Workshop on Recent Trends in News Information Retrieval co-located with 40th European Conference on Information Retrieval (ECIR 2018), Grenoble, France, March 26, 2018. CEUR Workshop Proceedings 2079, CEUR-WS.org 2018.
In the Encyclopedia of Big Data Technologies:
A query language is a high-level computer language for the retrieval and modification of data held in databases or files. Query languages usually consist of a collection of operators which can be applied to any valid instances of the data structure types of a data model, in any combination desired.
In the context of graph data management, a graph query language (GQL) defines the way to retrieve or extract data which have been modeled as a graph and whose structure is defined by a graph data model. Therefore, a GQL is designed to support specific graph operations, such as graph pattern matching and shortest path finding.
In the Encyclopedia of Big Data Technologies:
Navigational query languages for graph databases allow to recursively traverse the edges of a graph while checking for the existence of a path that satisfies certain regular conditions. The basic building block of such languages is the class of regular path queries (RPQs), which are expressions that compute the pairs of nodes that are linked by a path whose label satisfies a regular expression. RPQs are often extended with features that turn them more flexible for practical applications, e.g., with the ability to traverse edges in the backward direction (RPQs with inverses) or to express arbitrary patterns over the data (conjunctive RPQs).
LINK: https://doi.org/10.1007/978-3-319-63962-8_214-1
We consider the containment problem for regular queries with memory and regular queries with data tests: two recently proposed query languages for graph databases that, in addition to allowing the user to ask topological queries, also track how the data changes along paths connecting various points in the database. Our results show that the problem is undecidable in general. However, by allowing only positive data comparisons we find natural fragments with better static analysis properties: the containment problem is PSpace -complete in the case of regular queries with data tests and ExpSpace -complete in the case of regular queries with memory.
Highlights: -Study of graph query languages that can deal with data values. -Containment problem for these languages is undecidable in general. -Containment is decidable if one focuses on languages that can only check for equalities. -Proofs make use of automata models.
This book constitutes the refereed post-conference proceedings of the 22nd Iberoamerican Congress on Pattern Recognition, CIARP 2017, held in Valparaíso, Chile, in November 2017.
The 87 papers presented were carefully reviewed and selected from 156 submissions. The papers feature research results in the areas of pattern recognition, image processing, computer vision, multimedia and related fields.
Link: https://link.springer.com/book/10.1007%2F978-3-319-75193-1
RDF compression can be defined as the problem of encoding an RDF dataset using less bits than that required by text-based traditional serialization formats like RDF/XML, NTriples, or Turtle, among others. These savings immediately lead to more efficient storage (i.e., archival) and less transmission costs (i.e., less bits over the wire). Although this problem can be easily solved through universal compression (e.g., gzip or bzip2), optimized RDF-specific compressors take advantage of the particular features of RDF datasets (such as semantic redundancies) in order to save more bits or to provide retrieval operations on the compressed information. RDF self-indexes are focused on this latter task.
RDF self-indexes are RDF compressors that provide indexing features in a space close to that of the compressed dataset and can be accessed with no prior (or partial) decompression. These properties enhance scalability (i.e., less resources are required to serve semantic data) and speed up access as more information can be managed in higher levels of the memory hierarchy (typically, main memory or cache). In addition, efficient search algorithms have been proposed to resolve basic queries on top of self-indexed datasets. As a result, RDF self-indexes has been adopted as a core component of semantic search engines and lightweight Linked Data servers.
Finally, RDF stream compressors specifically focus on compressing a (continuous) stream of RDF data in order to improve exchange processes, typically in real-time. This constitutes a more recent trend that exploits different trade-offs between the space savings achieved by the compressor and the latency introduced in the compression/decompression processes.
This entry introduces basic notions of RDF compression, RDF self-indexing, and RDF stream compression and discusses how existing approaches deal with (and remove) redundant information in semantic datasets.
DOI https://doi.org/10.1007/978-3-319-63962-8_62-1
3D objects are an important type of data with many applications in domains such as Engineering and Computer Aided Design, Science, Simulation, Visualization, Cultural Heritage, and Entertainment. Technological progress in acquisition, modeling, processing, and dissemination of 3D geometry leads to the accumulation of large repositories of 3D objects. Consequently, there is a strong need to research and develop technology to support the effective retrieval of 3D object data from 3D repositories.
https://doi.org/10.1007/978-1-4899-7993-3_161-2