zur Startseite

HIL Data Provenance Research Collaboration with Industry

End-to-end provenance support for transparent data processing in complex data integration pipelines
ProjekttypResearch Collaboration with Industry
Gefördert durch IBM
Beginn 08/2016
Leiter Prof. Dr. rer. nat. Melanie Herschel
Kooperationspartner IBM Research - Almaden

Modern data analytics requires access to large amounts of data that is heterogeneous, unstructured, and residing in multiple documents, databases or repositories. Before it can be consumed by specialized machine learning algorithms, visualization applications or other advanced analytics, the raw data needs to undergo a complex pipeline of data operations that range from extraction and cleansing of the relevant information (e.g., extracted from text) to linking, fusion and aggregation of the related pieces of information into unified entities and relationships that describe a domain. Researchers at IBM Almaden, together with the Watson group, are developing a platform for developing and running the complex data engineering pipelines needed to create rich entity-centric content from raw data.

The project develops, in collaboration with IBM researchers, a fine-grained provenance model for data integration that is part of complex data engineering pipelines. Provenance will allow to ask questions about the origins of any piece of data processed during data integration. Examples include: “Where did the value CA come from?”, or “Why am I missing sales numbers for Asia subsidiaries?”.

Supporting such questions and providing them with precise, yet easy to understand answers can significantly improve the development effort for the data engineering pipelines in Watson-like systems. Further benefits are the improvement of both data quality and trust in the produced data.