Probabilistic data cleaning

The goal of this project is to study and develop probabilistic data cleaning techniques. Data cleaning refers to the process of detecting and repairing errors, duplicates and anomalies in data. In response to the large amounts of "dirty'' data in today's digital society, the data quality problem is enjoying a lot of interest from various disciplines in computer science. For instance, since most of the data resides in databases, efficient database techniques have been developed to improve the quality of data. These database techniques are mostly non-probabilistic in the sense that data is either clean or dirty, two objects are either the same or different, and repairs of the data are "one-shot''. That is, a single cleaned repair of the data is returned to the user, without any information on (a) why this repair is returned; (b) how reliable this repair is; and (c) whether other possible repairs exist that are of comparable quality. Clearly, such information is of great importance to assess the quality of these techniques. What is needed is a probabilistic approach to the data quality problem that provides assurances of the decisions made during the cleaning processes. The cornerstone of this project is the observation that many problems studied in probabilistic logic have a direct counterpart in research on data quality in databases, and vice versa. In this project we leverage these relationships to provide a solid foundation for data quality in a probabilistic setting. (FWO project with University of Leuven)

Querying distributed dynamic data collections

The ability to seamlessly query data residing in several distinct and distributed data sources has been a major driving force in database research and has led to the development of the first distributed database systems already in the nineteen eighties. Nowadays, our information society is characterized by data which is inherently dispersed but not necessarily residing in actually interconnected databases. Consider, for instance, the many, mostly independently operated, data collections in the life sciences where connections among data sources can be either hard-coded as explicit links or can be implicit through the semantics of the data. While each such collection can be queried through various portals, there are no centralized portals that provide an integrated interface to all of the data. An important reason for the lack of such centralized systems is the continuing growth in the number of data sources on the Web stemming from the ease by which scientists can provide data on-line. The combined data therefore spans a network of linked databases in which data sources can dynamically be added and removed, and whose contents can be changed, copied, transformed or even revoked without notice. In this project we will study and develop techniques for querying such dynamic distributed data collections. Our approach is based on three pillars: (1) the study of navigational query languages for linked data; (2) the study of distributed computing methods for distributed query evaluation; and, (3) the use of provenance as a mechanism for monitoring alterations to data sources. (FWO project with Hasselt University)

Computational models for big data algorithms

A central theme in computer science is the design of efficient algorithms . However, recent experiments show that many standard algorithms degrade significantly in the presence of big data. This is particularly true when evaluating classes of queries in the context of databases. Unfortunately, existing theoretical tools for analyzing algorithms cannot tell whether or not an algorithm will be feasible on big data. Indeed, algorithms that are considered to be tractable in the classical sense are not tractable anymore when big data is concerned. This calls for a revisit of classical complexity theoretical notions. The development of a formal foundation and an accompanying computational complexity to study tractability in the context of big data is the main goal of this project.