Powered by OpenAIRE graph

InfClean

Effective Inference of Cleaning Programs from Data Annotations
Funder: French National Research Agency (ANR)Project code: ANR-18-CE23-0019
Funder Contribution: 213,321 EUR
visibility
download
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
4
3
Description

This proposal addresses a pressing need in data science applications: besides reliable models for decision making, we need data that has been processed from its original, raw state into a curated form, a process referred to as “data cleaning”. In this process, data engineers collaborate with domain experts to collect specifications, such as business rules on salaries, physical constraints for molecules, or representative training data. Specifications are then encoded in cleaning programs to be executed over the raw data to identify and fix errors. This human-centric process is expensive and, given the overwhelming amount of today’s data, is conducted with a best effort approach, which does not provide any formal guarantee on the ultimate quality of the data. The goal of InfClean is to rethink the data cleaning field from its assumptions with an inclusive formal framework that radically reduces the human effort in cleaning data. This will be achieved in three steps: (1) by laying the theoretical foundations of synthesizing specifications directly with the domain experts; (2) by designing and implementing new automated techniques that use external information to identify and repair data errors; (3) by modeling the interactive cleaning process with a principled optimization framework that guarantees quality requirements. The project will lay a solid foundation for data cleaning, enabling a formal framework for specification synthesis, algorithms for increased automation, and a principled optimizer with quality performance guarantees for the user interaction. It will also broadly enable accelerated information discovery, as well as economic benefits of early, well-informed, trustworthy decisions. To provide the right context for evaluating these new techniques and highlight the impact of the project in different fields, InfClean plans to address its objectives by using real case studies from different domains, including health and biodiversity data.

Partners
Data Management Plans
  • OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 4
    download downloads 3
  • 4
    views
    3
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph

Do the share buttons not appear? Please make sure, any blocking addon is disabled, and then reload the page.

All Research products
arrow_drop_down
<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=anr_________::fa2c2ea6e5ee1e07bb9d55ebb4ea2ac0&type=result"></script>');
-->
</script>
For further information contact us at helpdesk@openaire.eu

No option selected
arrow_drop_down