zur Startseite

Research

I am currently a professor for Data Engineering at the department Applications of Parallel and Distributed Systems of the University of Stuttgart. The overarching goal of my research is to automatically prepare and engineer massive heterogeneous data for use in various real-life applications.

Research at the University of Stuttgart

My department at the University of Stuttgart focuses on four main research areas: data transparency, complex data processing, data wrangling, and data exploration and analysis. We conduct both foundational research as well as system and application-oriented research.

Research prior to joining the University of Stuttgart

In the following, I summarize various projects was or have been involved in prior to joining the University of Stuttgart.

Data Provenance

Why-Not provenance. Our work on Why-Not provenance focuses on theory and algorithms to explain why some tuples are not part of a query result, even though developers or users expected them to be. We have developed various algorithms producing instance-based explanations (Artemis [VLDB09,VLDB10]), query-based explanations (NedExplain [EDBT14], Ted [TAPP14,CIKM15]), and hybrid explanations (Conseil [CIKM13, JDIQ15]). The algorithms developed here are part of the Nautilus system (see below).
Work related to Why-Not provenance has been partly funded by the Eliteprogramm für Postdoktorandinnen und Postdoktoranden der Baden-Würrtemberg Stiftung and by an AAP Grant of Université Paris Sud.

Data Integration

SPIMBench. The Semantic Publishing Instance Matching Benchmark, in short, SPIMBench, is a novel instance matching (IM) benchmark for the assessment of IM techniques for RDF data with an associated schema. Essentially, SPIMBench proposes and implements: (i) a set of test cases based on transformations that distinguish different types of matching entities, (ii) a scalable data generator, (iii) a gold standard documenting the matches that IM systems should find, and (iv) evaluation metrics. SPIMBench extends the state-of-the-art IM benchmarks for RDF data in three main aspects: it allows for systematic scalability testing, supports a wider range of test cases, and provides an enriched gold standard. Recent publications include [ISWC15, WWW15] and an ISWC2014 tutorial.
SPIMBench is a collaboration with ICS Forth, Krete (Greece) and University of Leipzig (Germany).

Datalyse. The goal of Datalyse is to provide models, algorithms and tools for Big Data Analytics applications. The project will be built taking into account three use cases, respectively, on Open Data analytics, retail data analysis (including an analysis of social network data referring to retail products), and monitoring data (for security, energy consumption etc.) collected in a data center. Datalyse will provide conceptual perspectives and a library of software primitives based on these concepts for storing, indexing, analyzing, and refining various classes of “Big Data”. In this context, I am particularly interested in entity resolution and data fusion.
INRIA OAK participates to the national “Datalyse” project financed within the “Investissement d’Avenir” call “Cloud & Big Data 2012″.

Merging Autonomous Content with HumMer. The project Merging Autonomous Content focused on foundations, algorithms, and tools for database-style data integration, i.e. using integration operators and optimising integration tasks using these operators. I have been particularly involved in the entity resolution operator. The HumMer system served as a testbed for implementing and testing new ideas concerning the information integration process. In this context, we also developed XStruct, XQueryGen, DirtyXML.

Complex data processing

Transformation Lifecycle Management with Nautilus. Using declarative languages such as SQL to specify queries, developers often face the problem that they cannot properly inspect or debug their query or transformation code. All they see is the tip of the iceberg once the result data is computed. If it does not comply with the developers’ expectation, they usually perform one or more tedious and mostly manual analyze-fix-test cycles until the expected result occurs. The goal of Nautilus is to support developers in this process by providing a suite of algorithms and tools to accompany the process. In addition to the work on why-not data provenance described above, we have so far made advancements on query analysis and debugging [CIKM12] as well as provenance-supported query fixing [VLDB15]. An overview of the system is provided in [QDB11].

OakSaD. The OakSad project was a collaboration between INRIA Oak and database group from UC San Diego. Within OakSad, we worked on topics including (1) scalable management of complex data in the cloud, (2) provenance analysis in complex data processing workflows, and (3) data management tools and techniques for annotated documents.

XClean. We developed a system for XML data cleaning in cooperation with INRIA Futurs, France. This system, named XClean, allows a declarative specification of an XML cleaning process. This program is then compiled to an XQuery, which can then be executed on any Query processor [CIDR07, CAISE07].
This project was partly funded by a DAAD Doktorandenstipendium.

Publications

To get an overview of my publications, refer to the dedicated publications page, to my DBLP entry or Google Scholar's result.