Alternatives to recurrent neural networks, in particular, architectures based on attention or convolutions, have been gaining momentum for processing input sequences. In spite of their relevance, the computational properties of these alternatives have not yet been fully explored. We study the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer (Vaswani et al., 2017) and the Neural GPU (Kaiser & Sutskever, 2016). We show both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external memory to become Turing complete. Our study also reveals some minimal sets of elements needed to obtain these completeness results.
European politics at the turn of the 19th century saw a dramatic reduction in the number and diversity of polities as the territorial nation-state emerged as the dominant form of political organization. The transformation had a profound impact on the periphery. The study examines how embracing the principle of territoriality transformed relations between settler societies and indigenous peoples in South America. As this shift coincided with independence from Spain, Creole elites rapidly dismantled the remnants of imperial heteronomy, ending centuries of inter-cultural diplomacy. The study illustrates this shift in the case of the “Southern frontier,” where Spain had maintained a practice of treaty making with the Mapuche people since the mid-17th century. This long-standing practice broke down shortly after Chile gained independence in 1818. What followed was a policy of coercive assimilation through military conquest and forced displacement — a policy that settler societies implemented elsewhere in the 19th century. In contrast to explanations that emphasize the spread of capitalist agriculture and racist ideologies, this study argues that territoriality spelled the end of inter-cultural diplomacy along the “Southern frontier.”
This work presents and studies the efficiency problem of mapping GPU threads onto simplex domains. A non-linear map λ(ω) is formulated based on a block-space enumeration principle that reduces the number of thread-blocks by a factor of approximately 2× and 6× for 2-simplex and 3-simplex domains, respectively, when compared to the standard approach. Performance results show that λ(ω) is competitive and even the fastest map when ran in recent GPU architectures such as the Tesla V100, where it reaches up to 1.5× of speedup in 2-simplex tests. In 3-simplex tests, it reaches up to 2.3× of speedup for small workloads and up to 1.25× for larger ones. The results obtained make λ(ω) a useful GPU optimization technique with applications on parallel problems that define all-pairs, all-triplets or nearest neighbors interactions in a 2-simplex or 3-simplex domain.
Graphs are being increasingly adopted as a flexible data model in scenarios (e.g., Google’s Knowledge Graph, Facebook’s Graph API, Wikidata, etc.) where multiple editors are involved in content creation, where the schema is ever changing, where data are incomplete, where the connectivity of resources plays a key rolescenarios where relational models traditionally struggle. But with this flexibility comes a conceptual cost: it can be difficult to summarise and understand, at a high level, the content that a given graph contains. Hence profiling graphs becomes of increasing importance to extract order, a posteriori, from the chaotic processes by which such graphs are generated. This talk will motivate the use of graphs as a data model, abstract recent trends in graph data management, and then turn to the issue of profiling and summarising graphs: what are the goals of such profiling, the principles by which graphs can be summarised, the main techniques by which this can/could be achieved The talk will emphasise the importance of profiling graphs while highlighting a variety of open research questions yet to be tackled.
Jan Van den Bussche, Marcelo Arenas: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018. ACM 2018.
This volume contains the proceedings of PODS 2018, which include a paper for the keynote addressed by Michael Benedikt (University of Oxford), abstracts based on two invited tutorials by Rajeev Raman (University of Leicester) and Arvind Narayanan (Princeton University), and 29 contributions that were selected by the Program Committee for presentation at the symposium.
In addition, this volume also contains papers from our two «Gems of PODS» speakers, Hung Ngo (Relational AI) and Phokion G. Kolaitis (UC Santa Cruz and IBM Research – Almaden). The Gems of PODS is an event, started in 2016, where the goal is to promote understanding of past seminal PODS results to the general audience. The Gems of PODS papers were selected by the Gems of PODS committee consisting of Marcelo Arenas (chair) (Pontificia Universidad Católica de Chile), Tova Milo (Tel Aviv University) and Dan Olteanu (Oxford University).
This year, PODS continued with two submission cycles that were introduced two years ago. The first cycle allowed for the possibility for papers to be revised and resubmitted. For the first cycle, 30 papers were submitted, 7 of which were directly selected for inclusion in the proceedings, and 5 were invited for a resubmission after a revision. The quality of most of the revised papers increased substantially with respect to the first submission, and 4 out of 5 revised papers were selected for the proceedings. For the second cycle, 53 papers were submitted, 18 of which were selected, resulting in 29 papers selected overall from a total number of 83 submissions.
An important task for the Program Committee has been the selection of the PODS 2018 Best Paper Award. The committee selected the paper: «Entity Matching with Active Monotone Classification» by Yufei Tao. On behalf of the committee, we would like to extend our sincere congratulations to the author!
Since 2008, PODS assigns the ACM PODS Alberto O. Mendelzon Test-of-Time Award to a paper or a small number of papers published in the PODS proceedings ten years prior that had the most impact over the intervening decade. This year’s committee, consisting of Maurizio Lenzerini, Wim Martens and Nicole Schweikardt, selected the following paper: «The Chase Revisited» by Alin Deutsch, Alan Nash and Jeff Remmel.
Recent availability of data about writing processes at keystroke-granularity has enabled research on the evolution of document writing. A natural task is to develop systems that can actually show this data, that is, user interfaces that transform the data of the process of writing –today a black box– into intelligible forms. On this line, we propose a data structure that captures a document’s fine-grained history and an organic visualization that serves as an interface to it. We evaluate a proof-of-concept implementation of the system through a pilot study using documents written by students at a public university. Our results are promising and reveal facets such as general strategies adopted, local edition density and hierarchical structure of the final text.
In this paper, we propose a novel data-driven schema for large-scale heterogeneous knowledge graphs inspired by Formal Concept Analysis (FCA). We first extract the sets of properties associated with individual entities; these property sets (aka. characteristic sets) are annotated with cardinalities and used to induce a lattice based on set-containment relations, forming a natural hierarchical structure describing the knowledge graph. We then propose an algebra over such schema lattices, which allows to compute diffs between lattices (for example, to summarise the changes from one version of a knowledge graph to another), to add lattices (for example, to project future changes), and so forth. While we argue that this lattice structure (and associated algebra) may have various applications, we currently focus on the use-case of modelling and predicting the dynamic behaviour of knowledge graphs. Along those lines, we instantiate and evaluate our methods for analysing how versions of the Wikidata knowledge graph have changed over a period of 11 weeks. We propose algorithms for constructing the lattice-based schema from Wikidata, and evaluate their efficiency and scalability. We then evaluate use of the resulting schema(ta) for predicting how the knowledge graph will evolve in future versions.
Although the amount of RDF data has been steadily increasing over the years, the majority of information on the Web is still residing in other formats, and is often not accessible to Semantic Web services. A lot of this data is available through APIs serving JSON documents. In this work we propose a way of extending SPARQL with the option to consume JSON APIs and integrate the obtained information into SPARQL query answers, thus obtaining a query language allowing to bring data from the “traditional” Web to the Semantic Web. Looking to evaluate these queries as efficiently as possible, we show that the main bottleneck is the amount of API requests, and present an algorithm that produces “worst-case optimal” query plans that reduce the number of requests as much as possible. We also do a set of experiments that empirically confirm the optimality of our approach.
Graph data management concerns the research and development of powerful technologies for storing, processing and analyzing large volumes of graph data. This chapter presents an overview about the foundations and systems for graph data management. Specifically, we present a historical overview of the area, studied graph database models, characterized essential graph-oriented queries, reviewed graph query languages, and explore the features of current graph data management systems (i.e. graph databases and graph-processing frameworks).
Representing the movements of objects (trips) over a network in a compact way while retaining the capability of exploiting such data effectively is an important challenge of real applications. We present a new Compact Trip Representation (CTR) that handles the spatio-temporal data associated with users’ trips over transportation networks. Depending on the network and types of queries, nodes in the network can represent intersections, stops, or even street segments.
CTR represents separately sequences of nodes and the time instants when users traverse these nodes. The spatial component is handled with a data structure based on the well-known Compressed Suffix Array (CSA), which provides both a compact representation and interesting indexing capabilities. The temporal component is self-indexed with either a Hu–Tucker-shaped Wavelet-Tree or a Wavelet Matrix that solve range-interval queries efficiently. We show how CTR can solve relevant counting-based spatial, temporal, and spatio-temporal queries over large sets of trips. Experimental results show the space requirements (around 50–70% of the space needed by a compact non-indexed baseline) and query efficiency (most queries are solved in the range of 1–1000 µs) of CTR.
Highlights: -We provide a representation for trips over networks and answer counting-based queries. -We adapt a Compressed Suffix Array to deal with the spatial component of trips. -We use a wavelet matrix or a Hu–tucker-shaped Wavelet Tree for the temporal component. -Experiments show space needs until a 50% when compared with a plain representation. -Experiments show counting-based query-times typically within 1–1000 µs.
The aim of this research is to measure, using smart card data, how much do public transport users change their behavior through time. To quantify the change in behavior, we split the smart card records of a user into a set of separated time windows. Then, we measure the variation between each pair of time windows. Three algorithms that calculate the variation in users’ mobility are assessed. Using data from a Canadian transport system, we show that measuring the stability of user behavior at an individual level provides new insights for public transport operators, e.g., it can be used to measure users’ adaptability to changes in the transport system.
Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal intervals when they are active. This work explores the use of the Compressed Suffix Array (CSA), a well-known compact and self-indexed data structure in the area of text indexing, to represent large temporal graphs. The new structure, called Temporal Graph CSA (TGCSA), is experimentally compared with the most competitive compact data structures in the state-of-the-art, namely, EdgeLog and CET. The experimental results show that TGCSA obtains a good space-time trade-off. It uses a reasonable space and is efficient for solving complex temporal queries. Furthermore, TGCSA has wider expressive capabilities than EdgeLog and CET, because it is able to represent temporal graphs where contacts on an edge can temporally overlap.
Highlights: -We consider the problem of representing temporal graphs in a compact way. -We can represent temporal graphs with contacts on a same edge that temporally overlap. -We design TGCSA and SHOW how it solves typical temporal queries. -We create a novel representation of Ψ that improves the performance of TGCSA. -We obtain a reasonable space/time tradeoff even on complex temporal queries.
We present a demo of the GraFa faceted search and browsing interface over the Wikidata knowledge graph. We describe the key aspects of the interface, including the types of interactions that the system allows, the ranking schemes employed, and other features to aid usability. We also discuss future plans for improving the system. Online Demo: http://grafa.dcc.uchile.cl/
Docker is the most popular implementation of Operating System virtualization, currently its online registry service (Docker Hub) stores more than 4.5 millions of software images. Using that registry it is possible to download and deploy Docker images as software containers. However, these images only show information of the main software, hiding the dependencies needed to run it. To allow users to track what they deploy into their machines, we developed DockerPedia, a resource that publishes information of the packages within the Docker images as Linked Data.
Currently our resource includes 28% of the most downloaded images from Docker Hub providing information about the software dependencies and its vulnerabilities allowing to easily reproduce the environment in which each image was deployed as well as to check the security of the image without the need to download it.
We demonstrate a system to canonicalise (aka. normalise) SPARQL queries for use-cases such as caching, query log analysis, query minimisation, signing queries, etc. Our canonicalisation method deterministically rewrites a given input query to an equivalent canonical form such that the results for two queries are syntactically (string) equal if and only if they give the same results on any database, modulo variable names. The method is sound and complete for a monotone fragment of SPARQL with selection (equalities), projection, join and union under both set and bag semantics. Considering other SPARQL features (e.g., optional, filter, graph, etc.), the underlying equivalence problem becomes undecidable, where we currently rather support a best-effort canonicalisation for other SPARQL 1.0. features. We demonstrate a prototype of our canonicalisation framework, provide example rewritings, and discuss limitations, use-cases and future work. Demo link: http://qcan.dcc.uchile.cl
One of the challenges of recent RDF-based applications is managing data quality , and several systems already provide RDF validation procedures (e.g., https://www. stardog.com/docs/, https://www.topquadrant.com/technology/shacl/). This created the need for a standardized declarative constraint language for RDF, and for mechanisms to detect violations of such constraints. An important step in this direction is SHACL, or Shapes Constraint Language (https://www.w3.org/TR/shacl/) which has become a W3C recommendation in 2017. The SHACL specification however leaves explicitly undefined the validation of recursive constraints. In a previous article , we showed that extending the specification’s semantics to accommodate for recursion leads to intractability (in the size of the graph) for the so-called “core constraint components” of SHACL. This result holds for stratified constraints already, which may come as a surprise, considering that stratification guarantees tractability in well-studied recursive languages such as Datalog. Our previous work identified a tractable fragment of SHACL’s core components. In this paper, we propose an alternative approach to gain tractability, retaining all SHACL operators, but strengthening the stratification condition traditionally used in logic programming. More exactly, we introduce a syntactic condition on shape constraints called “strict stratification”, which guarantees that graph validation is in PTIME in combined (i.e. graph and constraints) complexity. We also describe a procedure to perform such validation. The current paper is not self-contained, due to space limitations, but all definitions can be found in our previous article  or its online extended version .
Entity Linking (EL) associates the entities mentioned in a given input text with their corresponding knowledge-base (KB) entries. A recent EL trend is towards multilingual approaches. However, one may ask: are multilingual EL approaches necessary with recent advancements in machine translation? Could we not simply focus on supporting one language in the EL system and translate the input text to that language? We present experiments along these lines comparing multilingual EL systems with their results over machine translated text.
Traditional video understanding tasks include human action recognition and actor/object semantic segmentation. However, the combined task of providing semantic segmentation for different actor classes simultaneously with their action class remains a challenging but necessary task for many applications. In this work, we propose a new end-to-end architecture for tackling this task in videos. Our model effectively leverages multiple input modalities, contextual information, and multitask learning in the video to directly output semantic segmentations in a single unified framework. We train and benchmark our model on the Actor-Action Dataset (A2D) for joint actor-action semantic segmentation, and demonstrate state-of-the-art performance for both segmentation and detection. We also perform experiments verifying our approach improves performance for zero-shot recognition, indicating generalizability of our jointly learned feature space.
Explaining automatic recommendations is an active area of research since it has shown an important eect on users’ acceptance over the items recommended. However, there is a lack of research in explaining content-based recommendations of images based on visual features. In this paper, we aim to ll this gap by testing three dierent interfaces (one baseline and two novel explanation interfaces) for artistic image recommendation. Our experiments with N=121 users conrm that explanations of recommendations in the image domain are useful and increase user satisfaction, perception of explainability, relevance, and diversity. Furthermore, our experiments show that the results are also dependent on the underlying recommendation algorithm used. We tested the interfaces with two algorithms: Deep Neural Networks (DNN), with high accuracy but with dicult to explain features, and the more explainable method based on Aractiveness Visual Features (AVF). e beer the accuracy performance –in our case the DNN method– the stronger the positive eect of the explainable interface. Notably, the explainable features of the AVF method increased the perception of explainability but did not increase the perception of trust, unlike DNN, which improved both dimensions. ese results indicate that algorithms in conjunction with interfaces play a signicant role in the perception of explainability and trust for image recommendation. We plan to further investigate the relationship between interface explainability and algorithmic performance in recommender systems.
Representing the trajectories of mobile objects is a hot topic from the widespread use of smartphones and other GPS devices. However, few works have focused on representing trips over public transportation networks (buses, subway, and trains) where user’s trips can be seen as a sequence of stages performed within a vehicle shared with many other users. In this context, representing vehicle journeys reduces the redundancy because all the passengers inside a vehicle share the same arrival time for each stop. In addition, each vehicle journey follows exactly the sequence of stops corresponding to its line, which makes it unnecessary to represent that sequence for each journey.
To solve data management for transportation systems, we designed a conceptual model that gave us a better insight into this data domain and allowed us the definition of relevant terms and the detection of redundancy sources among those data. Then, we designed two compact representations focused on users’ trips (𝖳𝖳𝖢𝖳𝖱) and on vehicle trips (𝖠𝖼𝗎𝗆𝖬), respectively. Each approach owns some strengths and is able to answer some queries efficiently.
We include experimental results over synthetic trips generated from accurate schedules obtained from a real network description (from the bus transportation system of Madrid) to show the space/time trade-off of both approaches. We considered a wide range of different queries about the use of the transportation network such as counting-based/aggregate queries regarding the load of any line of the network at different times.
The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with corresponding entities in a given knowledge base. While traditional EL approaches have largely focused on English texts, current trends are towards language-agnostic or otherwise multilingual approaches that can perform EL over texts in many languages. One of the obstacles to ongoing research on multilingual EL is a scarcity of annotated datasets with the same text in different languages. In this work we thus propose VoxEL: a manually-annotated gold standard for multilingual EL featuring the same text expressed in five European languages. We first motivate and describe the VoxEL dataset, using it to compare the behaviour of state of the art EL (multilingual) systems for five different languages, contrasting these results with those obtained using machine translation to English. Overall, our results identify how five state-of-the-art multilingual EL systems compare for various languages, how the results of different languages compare, and further suggest that machine translation of input text to English is now a competitive alternative to dedicated multilingual EL configurations.
Timely detection and accurate description of extreme events, such as natural disasters and other crisis situations, are crucial for emergency management and mitigation. Extreme-event detection is challenging, since one has to rely upon reports from human observers appointed to specific geographical areas, or on an expensive and sophisticated infrastructure. In the case of earthquakes, geographically dense sensor networks are expensive to deploy and maintain. Therefore, only some regions-or even countries-are able to acquire useful information about the effects of earthquakes in their own territory. An inexpensive and viable alternative to this problem is to detect extreme real-world events through people’s reactions in online social networks. In particular, Twitter has gained popularity within the scientific community for providing access to real-time “citizen sensor” activity. Nevertheless, the massive amount of messages in the Twitter stream, along with the noise it contains, underpin a number of difficulties when it comes to Twitter-based event detection. We contribute to address these challenges by proposing an online method for detecting unusual bursts in discrete-time signals extracted from Twitter. This method only requires a one-off semisupervised initialization and can be scaled to track multiple signals in a robust manner. We also show empirically how our proposed approach, which was envisioned for generic event detection, can be adapted for worldwide earthquake detection, where we compare the proposed model to the state of the art for earthquake tracking using social media. Experimental results validate our approach as a competitive alternative in terms of precision and recall to leading solutions, with the advantage of implementation simplicity and worldwide scalability.
Faceted browsing has become a popular paradigm for user interfaces on the Web and has also been investigated in the context of RDF graphs. However, current faceted browsers for RDF graphs encounter performance issues when faced with two challenges: scale, where large datasets generate many results, and heterogeneity, where large numbers of properties and classes generate many facets. To address these challenges, we propose GraFa: a faceted browsing system for heterogeneous large-scale RDF graphs based on a materialisation strategy that performs an offline analysis of the input graph in order to identify a subset of the exponential number of possible facet combinations that are candidates for indexing. In experiments over Wikidata, we demonstrate that materialisation allows for displaying (exact) faceted views over millions of diverse results in under a second while keeping index sizes relatively small. We also present initial usability studies over GraFa.
Visual embeddings from Convolutional Neural Networks (CNN) trained on the ImageNet dataset for the ILSVRC challenge have shown consistently good performance for transfer learning and are widely used in several tasks, including image recommendation. However, some important questions have not yet been answered in order to use these embeddings for a larger scope of recommendation domains: a) Do CNNs that perform better in ImageNet are also better for transfer learning in content-based image recommendation?, b) Does fine-tuning help to improve performance? and c) Which is the best way to perform the fine-tuning?
In this paper we compare several CNN models pre-trained with ImageNet to evaluate their transfer learning performance to an artwork image recommendation task. Our results indicate that models with better performance in the ImageNet challenge do not always imply better transfer learning for recommendation tasks (e.g. NASNet vs. ResNet). Our results also show that fine-tuning can be helpful even with a small dataset, but not every fine-tuning works. Our results can inform other researchers and practitioners on how to train their CNNs for better transfer learning towards image recommendation systems.
The progression of action theories is an important problem in knowledge representation. Progression is second-order definable and known to be first-order definable and effectively computable for restricted classes of theories. Motivated by the fact that universal theories with constants (UTCs) are expressive and natural theories whose satisfiability is decidable, in this paper we provide a thorough study of the progression of situation calculus UTCs. First, we prove that progression of a (possibly infinite) UTC is always first-order definable and results in a UTC. Though first-order definable, we show that the progression of a UTC may be infeasible, that is, it may result in an infinite UTC that is not equivalent to any finite set of first-order sentences. We then show that deciding whether %or not there is a feasible progression of a UTC is undecidable. Moreover, we show that deciding whether %or not a sentence (in an expressive fragment of first-order logic) is in the progression of a UTC is CONEXPTIME-complete, and that there exists a family of UTCs for which the size of every feasible progression grows exponentially. Finally, we discuss resolution-based approaches to compute the progression of a UTC. This comprehensive analysis contributes to a better understanding of progression in action theories, both in terms of feasibility and difficulty.
Blank nodes in RDF graphs can be used to represent values known to exist but whose identity remains unknown. A prominent example of such usage can be found in the Wikidata dataset where, e.g., the author of Beowulf is given as a blank node. However, while SPARQL considers blank nodes in a query as existentials, it treats blank nodes in RDF data more like constants. Running SPARQL queries over datasets with unknown values may thus lead to counter-intuitive results, which may make the standard SPARQL semantics unsuitable for datasets with existential blank nodes. We thus explore the feasibility of an alternative SPARQL semantics based on certain answers. In order to estimate the performance costs that would be associated with such a change in semantics for current implementations, we adapt and evaluate approximation techniques proposed in a relational database setting for a core fragment of SPARQL. To further understand the impact that such a change in semantics may have on query solutions, we analyse how this new semantics would affect the results of user queries over Wikidata.
Caching in the context of expressive query languages such as SPARQL is complicated by the difficulty of detecting equivalent queries: deciding if two conjunctive queries are equivalent is NP-complete, where adding further query features makes the problem undecidable. Despite this complexity, in this paper we propose an algorithm that performs syntactic canonicalisation of SPARQL queries such that the answers for the canonicalised query will not change versus the original. We can guarantee that the canonicalisation of two queries within a core fragment of SPARQL (monotone queries with select, project, join and union) is equal if and only if the two queries are equivalent; we also support other SPARQL features but with a weaker soundness guarantee: that the (partially) canonicalised query is equivalent to the input query. Despite the fact that canonicalisation must be harder than the equivalence problem, we show the algorithm to be practical for real-world queries taken from SPARQL endpoint logs, and further show that it detects more equivalent queries than when compared with purely syntactic methods. We also present the results of experiments over synthetic queries designed to stress-test the canonicalisation method, highlighting difficult cases.
With the popularity of RDF as an independent data model came the need for specifying constraints on RDF graphs, and for mechanisms to detect violations of such constraints. One of the most promising schema languages for RDF is SHACL, a recent W3C recommendation. Unfortunately, the specification of SHACL leaves open the problem of validation against recursive constraints. This omission is important because SHACL by design favors constraints that reference other ones, which in practice may easily yield reference cycles.
In this paper, we propose a concise formal semantics for the so-called “core constraint components” of SHACL. This semantics handles arbitrary recursion, while being compliant with the current standard. Graph validation is based on the existence of an assignment of SHACL “shapes” to nodes in the graph under validation, stating which shapes are verified or violated, while verifying the targets of the validation process. We show in particular that the design of SHACL forces us to consider cases in which these assignments are partial, or, in other words, where the truth value of a constraint at some nodes of a graph may be left unknown.
Dealing with recursion also comes at a price, as validating an RDF graph against SHACL constraints is NP-hard in the size of the graph, and this lower bound still holds for constraints with stratified negation. Therefore we also propose a tractable approximation to the validation problem.
Two-level indexes have been widely used to handle trajectories of moving objects that are constrained to a network. The top-level of these indexes handles the spatial dimension, whereas the bottom level handles the temporal dimension. The latter turns out to be an instance of the interval-intersection problem, but it has been tackled by non-specialized spatial indexes. In this work, we propose the use of a compact data structure on the bottom level of these indexes. Our experimental evaluation shows that our approach is both faster and smaller than existing solutions.
We present the first solution to τ-majorities on tree paths. Given a tree of n nodes, each with a label from [1..σ], and a fixed threshold 0<τ<1, such a query gives two nodes u and v and asks for all the labels that appear more than τ⋅|Puv| times in the path Puv from u to v, where |Puv| denotes the number of nodes in Puv. Note that the answer to any query is of size up to 1/τ. On a w-bit RAM, we obtain a linear-space data structure with O((1/τ)log∗nloglogwσ) query time. For any κ>1, we can also build a structure that uses O(nlog[κ]n) space, where log[κ]n denotes the function that applies logarithm κ times to n, and answers queries in time O((1/τ)loglogwσ). The construction time of both structures is O(nlogn). We also describe two succinct-space solutions with the same query time of the linear-space structure. One uses 2nH+4n+o(n)(H+1) bits, where H≤lgσ is the entropy of the label distribution, and can be built in O(nlogn) time. The other uses nH+O(n)+o(nH) bits and is built in O(nlogn) time w.h.p.
An important aspect of exploratory search over graph data is to understand what paths connect a given pair of nodes. Since the resulting paths can be manifold, various works propose ranking paths likely to be of interest to a user; these methods often rely on enumerating all such paths (up to a fixed length or number) before ranking is applied. In this paper, we instead propose applying a shortest path search on weighted versions of the graph in order to directly compute the most relevant path(s) between two nodes without fixed-length bounds, further obviating the need to enumerate irrelevant paths. We investigate weightings based on node degree, PageRank and edge frequency, contrasting the paths produced by these schemes over the Wikidata graph and discussing performance issues. Finally we conduct a user study over Wikidata where evaluators assess the quality of the paths produced; though inter-rater consensus on which paths are of most interest is low, we achieve statistically significant results to suggest that users find the weighted shortest paths more interesting than the baseline shortest paths without weights.
Experiment reproducibility is the ability to run an experiment with the introduction of changes to it and getting results that are consistent with the original ones. To allow reproducibility, the scientific community encourages researchers to publish descriptions of the these experiments. However, these recommendations do not include an automated way for creating such descriptions: normally scientists have to annotate their experiments in a semi automated way. In this paper we propose a system to automatically describe computational environments used in in-silico experiments. We propose to use Operating System (OS) virtualization (containerization) for distributing software experiments throughout software images and an annotation system that will allow to describe these software images. The images are a minimal version of an OS (container) that allow the deployment of multiple isolated software packages within it.
Conjunctive queries (CQs) fail to provide an answer when the pattern described by the query does not exactly match the data. CQs might thus be too restrictive as a querying mechanism when data is semistructured or incomplete. The semantic web therefore provides a formalism—known as (projected) well-designed pattern trees (pWDPTs)—that tackles this problem: pWDPTs allow us to formulate queries that match parts of the query over the data if available, but do not ignore answers of the remaining query otherwise. Here we abstract away the specifics of semantic web applications and study pWDPTs over arbitrary relational schemas. Since the language of pWDPTs subsumes CQs, their evaluation problem is intractable. We identify structural properties of pWDPTs that lead to (fixed-parameter) tractability of various variants of the evaluation problem. We also show that checking if a pWDPT is equivalent to one in our tractable class is in 2EXPTIME. As a corollary, we obtain fixed-parameter tractability of evaluation for pWDPTs with such good behavior. Our techniques also allow us to develop a theory of approximations for pWDPTs.
We present three pumping lemmas for three classes of functions definable by fragments of weighted automata over the min-plus semiring and the semiring of natural numbers. As a corollary we show that the hierarchy of functions definable by unambiguous, finitely-ambiguous, polynomiallyambiguous weighted automata, and the full class of weighted automata is strict for the minplus semiring.
Domains for spatial and temporal data are often multigranular in nature, possessing a natural order structure defined by spatial inclusion and time-interval inclusion, respectively. This order structure induces lattice-like (partial) operations, such as join, which in turn lead to join rules, in which a single domain element (granule) is asserted to be equal to, or contained in, the join of a set of such granules. In general, the efficient representation of such join rules is a difficult problem. However, there is a very effective representation in the case that the rule is bigranular; i.e., all of the joined elements belong to the same granularity, and, in addition, complete information about the (non)disjointness of all granules involved is known. The details of that representation form the focus of the paper.
Ontology-based data access (OBDA) is a successful application of knowledge representation and reasoning technologies in information management systems. One premier goal is to facilitate access to data that is heterogeneous and incomplete. This is achieved via an ontology that enriches the user query, typically a union of conjunctive queries, with domain knowledge. It turned out that the ontology and the user query can be seen as two components of one composite query, called ontology-mediated query (OMQ).
The problem of answering OMQs is thus central to OBDA. There is a consensus that the required level of scalability in OMQ answering can be achieved by using standard database management systems. To this end, a standard approach used nowadays is query rewriting: the ontology O and the database query q are combined into a new query qO, the so-called rewriting, which gives the same answer as the OMQ consisting of O and q over all input databases. It is of course essential that the rewriting qO is expressed in a language that can be handled by standard database systems. The typical language that is considered is the class of first-order (FO) queries.
In this work, we focus on two central OMQ languages based on guarded and frontierguarded tuple-generating dependencies (TGDs), and we study the problem whether an OMQ is FO-rewritable, i.e, it can be equivalently expressed as a first-order query. Recall that a guarded (resp., frontier-guarded) TGD is a sentence of the form ∀x, ¯ y¯(φ(¯x, y¯) → ∃z ψ¯ (¯x, z¯)), where φ and ψ are conjunctions of relational atoms, and φ has an atom that contains all the variables (¯x ∪ y¯) (resp., x¯) [1, 8]. Our goal is to develop specially tailored techniques that allow us to understand the above non-trivial problem, and also to pinpoint its computational complexity. To this end, as we discuss below, we follow two different approaches. Our results can be summarized as follows:
-We first focus on the simpler OMQ language based on guarded TGDs and atomic queries, and, in Section 2, we provide a characterization of FO-rewritability that forms the basis for applying tree automata techniques.
-We then exploit, in Section 3, standard two-way alternating parity tree automata. In particular, we reduce our problem to the problem of checking the finiteness of the language of an automaton. The reduction relies on a refined version of the characterization of FO-rewritability established in Section 2. This provides a transparent solution to our problem based on standard tools, but it does not lead to an optimal result.
-Towards an optimal result, we use, in Section 4, a more sophisticated automata model, known as cost automata. In particular, we reduce our problem to the problem of checking the boundedness of a cost automaton. This allows us to show that FOrewritability for OMQs based on guarded TGDs and atomic queries is in 2EXPTIME, and in EXPTIME for predicates of bounded arity. The complexity analysis relies on an intricate result on the boundedness problem for a certain class of cost automata [5, 9].
-Finally, in Section 5, by using the results of Section 4, we provide a complete picture for the complexity of our problem, i.e., deciding whether an OMQ based on (frontier-)guarded TGDs and arbitrary (unions of) conjunctive queries is FO-rewritable.
We focus on ontology-mediated queries (OMQs) based on (frontier-)guarded existential rules and (unions of) conjunctive queries, and we investigate the problem of FO-rewritability, i.e., whether an OMQ can be rewritten as a first-order query. We adopt two different approaches. The first approach employs standard two-way alternating parity tree automata. Although it does not lead to a tight complexity bound, it provides a transparent solution based on widely known tools. The second approach relies on a sophisticated automata model, known as cost automata. This allows us to show that our problem is 2EXPTIME-complete. In both approaches, we provide semantic characterizations of FO-rewritability that are of independent interest.
Automatic synthesis of software from specification is one of the classic problems in computer science. In the last decade, significant advances have been made in the synthesis of programs from specifications expressed in Linear Temporal Logic (LTL). LTL synthesis technology is central to a myriad of applications from the automated generation of controllers for Internet of Things devices, to the synthesis of control software for robotic applications. Unfortunately, the number of existing tools for LTL synthesis is limited, and using them requires specialized expertise. In this paper we present SynKit, a tool that offers LTL synthesis as a service. SynKit integrates a RESTful API and a web service with an editor, a solver, and a strategy visualizer.
This paper provides an overview of a model for capturing properties of client-server-based query computation setups. This model can be used to formally analyze different combinations of client and server capabilities, and compare them in terms of various fine-grain complexity measures. While the motivations and the focus of the presented work are related to querying the Semantic Web, the main concepts of the model are general enough to be applied in other contexts as well.
In this paper, we address the problem of LTL realizability and synthesis. State of the art techniques rely on so-called bounded synthesis methods, which reduce the problem to a safety game. Realizability is determined by solving synthesis in a dual game. We provide a unified view of duality, and introduce novel bounded realizability methods via reductions to reachability games. Further, we introduce algorithms, based on AI automated planning, to solve these safety and reachability games. This is the the first complete approach to LTL realizability and synthesis via automated planning. Experiments illustrate that reductions to reachability games are an alternative to reductions to safety games, and show that planning can be a competitive approach to LTL realizability and synthesis.
Conjunctive query (CQ) evaluation is NP-complete, but becomes tractable for fragments of bounded hypertreewidth. If a CQ is hard to evaluate, it is thus useful to evaluate an approximation of it in such fragments. While underapproximations (i.e., those that return correct answers only) are well-understood, the dual notion of overapproximations that return complete (but not necessarily sound) answers, and also a more general notion of approximation based on the symmetric difference of query results, are almost unexplored. In fact, the decidability of the basic problems of evaluation, identification, and existence of those approximations, is open. We develop a connection with existential pebble game tools that allows the systematic study of such problems. In particular, we show that the evaluation and identification of overapproximations can be solved in polynomial time. We also make progress in the problem of existence of overapproximations, showing it to be decidable in 2EXPTIME over the class of acyclic CQs. Furthermore, we look at when overapproximations do not exist, suggesting that this can be alleviated by using a more liberal notion of overapproximation. We also show how to extend our tools to study symmetric difference approximations. We observe that such approximations properly extend under- and over-approximations, settle the complexity of its associated identification problem, and provide several results on existence and evaluation.
We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.
LTL synthesis is the task of generating a strategy that satisfies a Linear Temporal Logic (LTL) specification interpreted over infinite traces. In this paper we examine the problem of LTLf synthesis, a variant of LTL synthesis where the specification of the behaviour of the strategy we generate is interpreted over finite traces — similar to the assumption we make in many planning problems, and important for the synthesis of business processes and other system interactions of finite duration. Existing approaches to LTLf synthesis transform LTLf into deterministic finite-state automata (DFA) and reduce the synthesis problem to a DFA game. Unfortunately, the DFA transformation is worst-case double-exponential in the size of the formula, presenting a computational bottleneck. In contrast, our approach exploits non-deterministic automata, and we reduce the synthesis problem to a non-deterministic planning problem. We leverage our approach not only for strategy generation but also to generate certificates of unrealizability — the first such method for LTLf. We employ a battery of techniques that exploit the structure of the LTLf specification to improve the efficiency of our transformation to automata. We combine these techniques with lazy determinization of automata and on-the-fly state abstraction. We illustrate the effectiveness of our approach on a set of established LTL synthesis benchmarks adapted to finite LTL.
Many efforts have been dedicated to identifying restrictions on ontologies expressed as tuple-generating dependencies (tgds), a.k.a. existential rules, that lead to the decidability of answering ontology-mediated queries (OMQs). This has given rise to three families of formalisms: guarded, non-recursive, and sticky sets of tgds. We study the containment problem for OMQs expressed in such formalisms, which is a key ingredient for solving static analysis tasks associated with them. Our main contribution is the development of specially tailored techniques for OMQ containment under the classes of tgds stated above. This enables us to obtain sharp complexity bounds for the problems at hand.
Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages in order to locate the data that a user wants to extract from a text document, and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have good evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Towards this goal, we present a practical evaluation algorithm that allows constant delay enumeration of a spanner’s output after a precomputation phase that is linear in the document. While the algorithm assumes that the spanner is specified in a syntactic variant of variable set automata, we also study how it can be applied when the spanner is specified by general variable set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner, providing a fine grained analysis of the classes of document spanners that support efficient enumeration of their results.
Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, all language proposals introduced so far are designed to output relations, thus making them incapable of handling incomplete information. To remedy the situation, we propose to extend information extraction languages with the ability to use mappings, thus allowing us to work with documents which have missing or optional parts. Using this approach, we simplify the semantics of regex formulas and extraction rules, two previously defined methods for extracting information. We extend them with the ability to handle incomplete data, and study how they compare in terms of expressive power. We also study computational properties of these languages, focusing on the query enumeration problem, as well as satisfiability and containment.
Since its invention, the Web has evolved into the largest multimedia repository that has ever existed. This evolution is a direct result of the explosion of user-generated content, explained by the wide adoption of social network platforms. The vast amount of multimedia content requires effective management and retrieval techniques. Nevertheless, Web multimedia retrieval is a complex task because users commonly express their information needs in semantic terms, but expect multimedia content in return. This dissociation between semantics and content of multimedia is known as the semantic gap. To solve this, researchers are looking beyond content-based or text-based approaches, integrating novel data sources. New data sources can consist of any type of data extracted from the context of multimedia documents, defined as the data that is not part of the raw content of a multimedia file. The Web is an extraordinary source of context data, which can be found in explicit or implicit relation to multimedia objects, such as surrounding text, tags, hyperlinks, and even in relevance-feedback. Recent advances in Web multimedia retrieval have shown that context data has great potential to bridge the semantic gap. In this article, we present the first comprehensive survey of context-based approaches for multimedia information retrieval on the Web. We introduce a data-driven taxonomy, which we then use in our literature review of the most emblematic and important approaches that use context-based data. In addition, we identify important challenges and opportunities, which had not been previously addressed in this area.
The news industry has undergone a revolution in the past decade, with substantial changes continuing to this day. News consumption habits are changing due to the increase in the volume of news and the variety of sources. Readers need new mechanisms to cope with this vast volume of information in order to not only find a signal in the noise, but also to understand what is happening in the world given the multiple points of view describing events. These challenges in journalism relate to Information Retrieval (IR) and Natural Language Processing (NLP) fields such as: verification of a source’s reliability; the integration of news with other sources of information; real-time processing of both news content and social streams; de-duplication of stories; and entity detection and disambiguation. Although IR and NLP have been applied to news for decades, the changing nature of the space requires fresh approaches and a closer collaboration with our colleagues from the journalism environment. Following the success of the previous version of the workshop (NewsIR’16), the goal of this workshop, held in conjunction with ECIR 2018, is to continue to stimulate such discussion between the communities and to share interesting approaches to solve real user problems. A total number of 19 submissions were received and reviewed, of which 12 were accepted for presentation. In addition to that, we had over 30 registered participants in the workshop who were pleased to attend the two keynote talks given by well-known experts in the field – Edgar Meij (from industry) and Peter Tolmie (from academia) and oral and poster presentations from the accepted papers. The workshop also included a breakout session to discuss ideas for a future data challenge in news IR and closed with a focused panel discussion to reflect on the day. In summary, several ideas were presented in the workshop on solving complex information needs in the news domain. In addition, the workshop concluded with suggestions of important challenges and shared tasks to work on as a community for News IR.
The Modified Mercalli Intensity Scale is a measure of the severity of an earthquake for a nonscientist. Since the Mercalli scale is based on perceived effects, it has a strong dependence on observers. Typically, these reports take time to be prepared and, as a consequence, Mercalli intensities are published hours after the occurrence of an earthquake. The National Seismological Center of Chile needs to provide a preliminary overview of the observed effects of an earthquake. This has motivated us to create a system for early tracking of people’s reaction in social networks to infer Mercalli intensities. By tracking people’s comments about the effects of an earthquake, a collection of Mercalli point estimates is retrieved at county level of granularity. We introduce the concept of Reinforced Mercalli support that combines Mercalli point estimates with social support, allowing to discard social unsupported estimates. Experimental results show that our proposal is accurate providing early Mercalli reports 30 min after an earthquake, detecting the maximum Mercalli intensity of an event with high accuracy in terms of mean absolute error (MAE).
In general, existing methods for automatically detecting emergency situations using Twitter rely on features based on domain-specific keywords found in messages. This type of keyword-based methods usually require training on domain-specific labeled data, using multiple languages, and for different types of events (e.g., earthquakes, floods, wildfires, etc.). In addition to being costly, these approaches may fail to detect previously unexpected situations, such as uncommon catastrophes or terrorist attacks. However, collective mentions of certain keywords are not the only type of self-organizing phenomena that may arise in social media when a real-world extreme situation occurs. Just as nearby physical sensors become activated when stimulated, localized citizen sensors (i.e., users) will also react in a similar manner. To leverage this information, we propose to use self-organized activity related to geolocations to identify emergency situations. We propose to detect such events by tracking the frequencies, and probability distributions of the interarrival time of the messages related to specific locations. Using an off-the-shelf classifier that is independent of domain-specific features, we study and describe emergency situations based solely on location-based features in messages. Our findings indicate that anomalies in location-related social media user activity indeed provide information for automatically detecting emergency situations independent of their domain.
Some decades have passed since the concept of “named entity” was used for the first time. Since then, new lines of research have emerged in this environment, such as linking the (named) entity mentions in a text collection with their corresponding knowledge-base entries.
However, this introduces problems with respect to a consensus on the definition of the concept of “entity” in the literature. This paper aims to highlight the importance of formalizing the concept of “entity” and the benefits it would bring to the Entity Linking community, in particular relating to the construction of gold standards for evaluation purposes.
Validating RDF graphs against constraints has gained interest in recent years, due to the popularity of RDF and the growth of knowledge bases. SHACL, a constraint language for RDF, has recently become a W3C recommendation, with a specification detailing syntax, semantics and common use cases. Unfortunately, this (otherwise complete) specification does not cover validation against recursive constraints. This omission is important, because SHACL by design favors constraint references. We investigate the possibility of a formal semantics for SHACL which covers the recursive case, while being compliant with the current standard.
Most of the current graph database systems have been designed to support property graphs. Surprisingly, there is no standard specification of the database model behind such systems. This paper presents a formal definition of the property graph database model. Specifically, we define the property graph data structure, basic notions of integrity constraints (e.g. graph schema), and a graph query language.
The use of the join operator in metric spaces leads to what is known as a similarity join, where objects of two datasets are paired if they are somehow similar. We propose an heuristic that solves the 1-NN selfsimilarity join, that is, a similarity join of a dataset with itself, that brings together each element with its nearest neighbor within the same dataset. Solving the problem using a simple brute-force algorithm requires O(n 2 ) distance calculations, since it requires to compare every element against all others. We propose a simple divide-and-conquer algorithm that gives an approximated solution for the self-similarity join that computes only O(n 3 2 ) distances. We show how the algorithm can be easily modified in order to improve the precision up to 31% (i.e., the percentage of correctly found 1-NNs) and such that 79% of the results are within the 10-NN, with no significant extra distance computations. We present how the algorithm can be executed in parallel and prove that using Θ( √ n) processors, the total execution takes linear time. We end discussing ways in which the algorithm can be improved in the future.
AMW 2018, Alberto Mendelzon Workshop on Foundations of Data Management.
Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management Cali, Colombia, May 21-25, 2018.
Dan Olteanu, University of Oxford, UK
Bárbara Poblete, University of Chile, Chile.
Graph-based data models  have become increasingly common in data management scenarios that require flexibility beyond what is offered by traditional relational databases. Such flexibility is particularly important in Web scenarios, where potentially many users may be involved (either directly or indirectly) in the creation, management, and curation of data. An example of such a scenario is the Wikidata knowledge graph  where users can add new properties and types that can be used to define further data. The flip-side of flexibility is higher levels of heterogeneity. Conceptually understanding the current state of a knowledge graph – in terms of what data it contains, what it is missing, how it can be effectively queried, what has changed recently, etc. – is thus a major challenge: it is unclear how to distil an adequate, high-level description that captures an actionable overview of knowledge graphs. We thus need well-founded methodologies to make sense of knowledge graphs, where an obvious approach is to define some notion of schema for such graphs.
The traditional approach in the Semantic Web has been what Pham and Boncz  call the schema first approach, which defines the schema that the data should follow. The most established language for specifying such schemas is RDFS. An alternative to the schema first approach is the schema last approach , which foregoes an upfront schema and rather lets the data evolve naturally; thereafter, the goal is to understand what the legacy graph data contain by extracting highlevel summaries that characterise the graph, resulting in a data-driven schema. In this paper, we summarise recently published results  on a novel approach to compute a data-driven schema from knowledge graphs. We believe that such schemas are useful for understanding what a knowledge graph contains, and how it can be queried, among several other use-cases. Nevertheless, in this work we focus on the use-case of predicting how the knowledge graph will evolve in future versions, which could be used for measuring the time-to-live of cached SPARQL results, identifying missing properties for entities, etc.
We consider the problem of determining whether organizations facing a new data-transformation task can avoid building a new transformation procedure from scratch by reusing their stored procedures. Because it can be difficult to obtain exact descriptions of what stored procedures do, our framework abstracts data-transforming tools as black-box procedures, in which a procedure description indicates the parts of the database that might be modified by the procedure and constraints on the states of the database that must hold before and after the application of this procedure.
In this paper we present our framework and study the problem of determining, given a database and a set of procedures, whether there is a sequence of procedures from this set such that their application to the database results in the satisfaction of a boolean query. This data readiness problem is undecidable in general, but we show decidability for a broad and realistic class of procedures.
GraphQL is a recently proposed, and increasingly adopted, conceptual framework for providing a new type of data access interface on the Web. The framework includes a new graph query language whose semantics has been specified informally only. This has prevented the formal study of the main properties of the language. We embark on the formalization and study of GraphQL. To this end, we first formalize the semantics of GraphQL queries based on a labeled-graph data model. Thereafter, we analyze the language and show that it admits really efficient evaluation methods. In particular, we prove that the complexity of the GraphQL evaluation problem is NL-complete. Moreover, we show that the enumeration problem can be solved with constant delay. This implies that a server can answer a GraphQL query and send the response byte-by-byte while spending just a constant amount of time between every byte sent. Despite these positive results, we prove that the size of a GraphQL response might be prohibitively large for an internet scenario. We present experiments showing that current practical implementations suffer from this issue. We provide a solution to cope with this problem by showing that the total size of a GraphQL response can be computed in polynomial time. Our results on polynomial-time size computation plus the constant-delay enumeration can help developers to provide more robust GraphQL interfaces on the Web.
Despite its importance to the Web, multimedia content is often neglected when building and designing knowledge-bases: though descriptive metadata and links are often provided for images, video, etc., the multimedia content itself is often treated as opaque and is rarely analysed. IMGpedia is an effort to bring together the images of Wikimedia Commons (including visual information), and relevant knowledge-bases such as Wikidata and DBpedia. The result is a knowledge-base that incorporates similarity relations between the images based on visual descriptors, as well as links to the resources of Wikidata and DBpedia that relate to the image. Using the IMGpedia SPARQL endpoint, it is then possible to perform visuo-semantic queries, combining the semantic facts extracted from the external resources and the similarity relations of the images. This paper presents a new web interface to browse and explore the dataset of IMGpedia in a more friendly manner, as well as new visuo-semantic queries that can be answered using 6 million recently added links from IMGpedia to Wikidata. We also discuss future directions we foresee for the IMGpedia project.
We research the problem of building knowledge maps of graph-like information. We live in the digital era and similarly to the Earth, the Web is simply too large and its interrelations too complex for anyone to grasp much of it through direct observation. Thus, the problem of applying cartographic principles also to digital landscapes is intriguing. We introduce a mathematical formalism that captures the general notion of map of a graph and enables its development and manipulation in a semi-automated way. We describe an implementation of our formalism on the Web of Linked Data graph and discuss algorithms that efficiently generate and combine (via an algebra) regions and maps. Finally, we discuss examples of knowledge maps built with a tool implementing our framework.
Tim Berners-Lee, Sarven Capadisli, Stefan Dietze, Aidan Hogan, Krzysztof Janowicz, Jens Lehmann. LDOW@WWW 2018, Lyon, France April 23rd, 2018. CEUR Workshop Proceedings 2073, CEUR-WS.org 2018.
Linear Temporal Logic ( 𝖫𝖳𝖫 ) synthesis can be understood as the problem of building a controller that defines a winning strategy, for a two-player game against the environment, where the objective is to satisfy a given 𝖫𝖳𝖫 formula. It is an important problem with applications in software synthesis, including controller synthesis. In this paper we establish the correspondence between 𝖫𝖳𝖫 synthesis and fully observable non-deterministic (FOND) planning. We study 𝖫𝖳𝖫 interpreted over both finite and infinite traces. We also provide the first explicit compilation that translates an 𝖫𝖳𝖫 synthesis problem to a FOND problem. Experiments with state-of-the-art 𝖫𝖳𝖫 FOND and synthesis solvers show automated planning to be a viable and effective tool for highly structured 𝖫𝖳𝖫 synthesis problems.
Info-boxes provide a summary of the most important meta-data relating to a particular entity described by a Wikipedia article. However, many articles have no info-box or have info-boxes with only minimal information; furthermore, there is a huge disparity between the level of detail available for info-boxes in English articles and those for other languages. Wikidata has been proposed as a central repository of facts to try to address such disparities, and has been used as a source of information to generate info-boxes. However, current processes still rely on human intervention either to create generic templates for entities of a given type or to create a specific info-box for a specific article in a specific language. As such, there are still many articles of Wikipedia without info-boxes but where relevant data are provided by Wikidata. In this paper, we investigate fully automatic methods to generate info-boxes for Wikipedia from the Wikidata knowledge graph. The primary challenge is to create ranking mechanisms that provide an intuitive prioritisation of the facts associated with an entity. We discuss this challenge, propose several straightforward metrics to prioritise information in info-boxes, and present an initial user evaluation to compare the quality of info-boxes generated by various metrics.
Navigational queries over RDF data are viewed as one of the main applications of graph query languages, and yet the standard model of graph databases—essentially labeled graphs—is different from the triples-based model of RDF. While encodings of RDF databases into graph data exist, we show that even the most natural ones are bound to lose some functionality when used in conjunction with graph query languages. The solution is to work directly with triples, but then many properties taken for granted in the graph database context (e.g., reachability) lose their natural meaning.
Our goal is to introduce languages that work directly over triples and are closed, i.e., they produce sets of triples, rather than graphs. Our basic language is called TriAL, or Triple Algebra: it guarantees closure properties by replacing the product with a family of join operations. We extend TriAL with recursion and explain why such an extension is more intricate for triples than for graphs. We present a declarative language, namely a fragment of datalog, capturing the recursive algebra. For both languages, the combined complexity of query evaluation is given by low-degree polynomials. We compare our language with previously studied graph query languages such as adaptations of XPath, regular path queries, and nested regular expressions; many of these languages are subsumed by the recursive triple algebra. We also provide an implementation of recursive TriAL on top of a relational query engine, and we show its usefulness by running a wide array of navigational queries over real-world RDF data, while at the same time testing how our implementation compares to existing RDF systems.
The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.
In the context of protein engineering and biotechnology, the discovery and characterization of structural patterns is very relevant as it can give fundamental insights about protein structures. In this paper we present GSP4PDB, a bioinformatics web tool that lets the users design, search and analyze protein-ligand structural patterns inside the Protein Data Bank (PDB). The novel feature of GSP4PDB is that a protein-ligand structural pattern is graphically designed as a graph such that the nodes represent protein’s components and the edges represent structural relationships. The resulting graph pattern is transformed into a SQL query, and executed in a PostgreSQL database system where the PDB data is stored. The results of the search are presented using a textual representation, and the corresponding binding-sites can be visualized using a JSmol interface.
Community Question Answering (cQA) sites have emerged as platforms designed specifically for the exchange of questions and answers among users. Although users tend to find good quality answers in cQA sites, they also engage in a significant volume of QA interactions in other platforms, such as microblog networking sites. This in part is explained because microblog platforms contain up-to-date information on current events, provide rapid information propagation, and have social trust.
Despite the potential of microblog platforms, such as Twitter, for automatic QA retrieval, how to leverage them for this task is not clear. There are unique characteristics that differentiate Twitter from traditional cQA platforms (e.g., short message length, low quality and noisy information), which do not allow to directly apply prior findings in the area. In this work, we address this problem by studying: (1) the feasibility of Twitter as a QA platform and (2) the discriminating features that identify relevant answers to a particular query. In particular, we create a document model at conversation-thread level, which enables us to aggregate microblog information, and set up a learning-to-rank framework, using factoid QA as a proxy task. Our experimental results show microblog data can indeed be used to perform QA retrieval effectively. We identify domain-specific features and combinations of those features that better account for improving QA ranking, achieving a MRR of 0.7795 (improving 62% over our baseline method). In addition, we provide evidence that our method allows to retrieve complex answers to non-factoid questions.
Dyaa Albakour, David Corney, Julio Gonzalo, Miguel Martinez, Barbara Poblete, Andreas Valochas: Proceedings of the Second International Workshop on Recent Trends in News Information Retrieval co-located with 40th European Conference on Information Retrieval (ECIR 2018), Grenoble, France, March 26, 2018. CEUR Workshop Proceedings 2079, CEUR-WS.org 2018.
We consider the containment problem for regular queries with memory and regular queries with data tests: two recently proposed query languages for graph databases that, in addition to allowing the user to ask topological queries, also track how the data changes along paths connecting various points in the database. Our results show that the problem is undecidable in general. However, by allowing only positive data comparisons we find natural fragments with better static analysis properties: the containment problem is PSpace -complete in the case of regular queries with data tests and ExpSpace -complete in the case of regular queries with memory.
Highlights: -Study of graph query languages that can deal with data values. -Containment problem for these languages is undecidable in general. -Containment is decidable if one focuses on languages that can only check for equalities. -Proofs make use of automata models.
This book constitutes the refereed post-conference proceedings of the 22nd Iberoamerican Congress on Pattern Recognition, CIARP 2017, held in Valparaíso, Chile, in November 2017.
The 87 papers presented were carefully reviewed and selected from 156 submissions. The papers feature research results in the areas of pattern recognition, image processing, computer vision, multimedia and related fields.
3D objects are an important type of data with many applications in domains such as Engineering and Computer Aided Design, Science, Simulation, Visualization, Cultural Heritage, and Entertainment. Technological progress in acquisition, modeling, processing, and dissemination of 3D geometry leads to the accumulation of large repositories of 3D objects. Consequently, there is a strong need to research and develop technology to support the effective retrieval of 3D object data from 3D repositories.