Powered by OpenAIRE graph

UK Data Service

UK Data Service

2 Projects, page 1 of 1
  • Funder: UK Research and Innovation Project Code: ES/Z502984/1
    Funder Contribution: 158,520 GBP

    The proposal outlines a project geared towards revolutionizing data accessibility and security through innovative data synthesis techniques. We first highlight one bottleneck in the data discovery process: the scarcity of good teaching datasets, particularly for data that sit in virtual research environments where access restrictions impede their creation. The creates a discoverability challenge for new users, who are unable to explore data before going through an approval process, increasing barriers to entry. While synthetic data is a potential solution, concerns about risk and utility exist. Data services often grapple with assessing the disclosure risk associated with synthetic data, as it deviates from the scope of conventional output disclosure control rules. Moreover, there is uncertainty about its utility, especially when specific analyses might yield results diverging from real data, diminishing the training process's effectiveness. The project has three objectives: (1) investigate tailored teaching datasets for restricted data access, (2) develop a systematic approach to assess disclosure risk in analytical outputs from restricted data sources, and (3) assess the feasibility of producing linked synthetic data from different sources (using the same methodology). The project spans from April 2024 to March 2025 and falls primarily under Theme 2: Data discovery using machine learning or other AI technologies, but also has the potential to add value under the other two themes (with objective 3 speaking to the federated services agenda and objective 2 providing a tool for augmenting the skills of output checkers). A preliminary study conducted at Manchester University, in collaboration with Administrative Data Research UK, demonstrates the feasibility of generating synthetic datasets with both high utility and low risk. The methodology involves leveraging cleared analytical outputs from data services as the basis for generating synthetic data using a genetic algorithm. The goal is to provide trainees with data that not only closely resembles real-world data but also yields analytical output very similar to that of the real data, enhancing the training experience. Beyond merely this replication of analytical properties, the approach also offers a route to formalise assessment the disclosure risk associated with analytical outputs from safe settings. By embodying statistical outputs in synthetic data, it enables a systematic evaluation of disclosure risk, addressing the informality and potential inconsistencies present in current output checking procedures. Furthermore, the project aims to bolster the federated services agenda by exploring the creation of synthetic linked data from using analytical outputs from data of multiple services. This approach expands the possibilities of data synthesis without the need for actual linkage and elaborate governance of infrastructure, such as trusted third parties. Deliverables include open-source code, example synthetic datasets, and academic papers aimed at knowledge dissemination and skill development. The project emphasizes collaboration among data providers, services, and stakeholders to address challenges in data accessibility and security. In essence, the project aims to redefine data accessibility by providing tailored teaching datasets and systematic disclosure risk assessment methods. It will also foster a collaborative ecosystem for transformative advancements in data synthesis and access management, and contributes to the broader research data landscape.

    more_vert
  • Funder: UK Research and Innovation Project Code: ES/Z502947/1
    Funder Contribution: 335,479 GBP

    Advances in artificial intelligence (AI) are revolutionising how we search for information. Large language models (LLMs), such as OpenAI's 'Chat-GPT' or Google's 'Bard', are good at understanding what we say and the meaning behind our words. Through conversations with these tools, they are helping to improve the accuracy of what information we want to find. While existing search tools focus on using 'keywords', this may not always give good answers. LLMs help people who might not know the exact words to say, because they know the context and relationships behind our language. They can adapt to different ways of asking questions, as well as provide explanations about why they found such information. We believe that these maturing technologies can help researchers search for data. Through training existing LLMs to learn what UKRI-supported research data exist, we can make the most of their existing abilities to understand human language to create a powerful data search tool. Their potential to be used as a data search tool is unknown and we are not aware of any existing tools for UK research datasets. Our proposal will develop, pilot and evaluate the effectiveness of LLMs to this end. The main output of this work will be a fully deployable 'chat box' search tool that researchers will be able to use to discover research datasets. To achieve this, we will collate the metadata of data catalogues across a range of UKRI research investments including the Consumer Data Research Centre, NERC Environmental Data Service, Administrative Data Research UK and UK Data Service. Through combining data catalogues across these unconnected services, we provide a new single 'port of call' for searching research data. We will design our project so that it can easily adapt to integrate new datasets. These data will then be used to develop a new AI derived search tool based on LLMs. We want to understand how these technologies can be used effectively by researchers and whether they will give more useful searches. Our mixed methods approach will test and evaluate the acceptability, suitability, and performance of our new search tool in comparison to existing UKRI search tools. This will include focus groups to qualitatively examine the acceptability of LLMs for data discovery, a quantitative comparison of how our new tool performs against existing keyword search tools, and by running tests that task participants with searching for data. We will report the strengths and limitations of LLMs to examine how useful they are. We will make recommendations for how they can be deployed, refined and sustain the changing ways in how researchers search for data. Our project will bring added value to existing UKRI data discovery resources through creating a new tool that will know the context and meaning of search queries, providing a broader and more accurate list of datasets based on what is searched for. We hope that this will help researchers to find exactly the data they need for their research.

    more_vert

Do the share buttons not appear? Please make sure, any blocking addon is disabled, and then reload the page.

Content report
No reports available
Funder report
No option selected
arrow_drop_down

Do you wish to download a CSV file? Note that this process may take a while.

There was an error in csv downloading. Please try again later.