UK's grand plan to fuel AI with public data faces uphill battle

Agents will look for info elsewhere unless official sources sharpen up

by · The Register

The UK's hopes of fueling cutting-edge AI development and applications with a National Data Library (NDL) could be dashed unless it makes datasets easier to use.

With misleading titles and non-existent metadata, the data currently available cannot support any meaningful analysis, a study from the Open Data Institute (ODI) found.

In the Autumn Budget of 2024, the government confirmed plans for the NDL, promising researchers and businesses "powerful insights that will drive growth and transform people's quality of life through better public services and cutting‑edge innovation, including AI." In January, it published an update, saying the plan was backed by a £100 million investment as part of £1.9 billion being provided to the Department for Science, Innovation and Technology (DSIT) through 2028/29.

DSIT said it had completed an extensive discovery phase to map out "the biggest opportunities and priorities" and "test approaches to systemic reform" across the public sector.

However, the ODI has published an "NDL-Lite" prototype, with access to more than 100,000 public datasets. It found some of the datasets – particularly on data.gov.uk – are badly labelled, out of date, or effectively invisible to AI tools. When authoritative data is hard to access, AI systems turn to other sources, such as news reports or commercial data, which do not always give accurate information, the ODI warned.

The prototype gathered 38 GB of data from six public sector sources, processing and standardizing more than 100,000 files into a single resource. While the study showed the NDL could be built at relatively low cost, it also highlighted the work needed to make the data AI-ready.

The study found that even broad terms such as "crime" were difficult to analyze or track properly. Some datasets with that label were local authority statistical releases that could not be combined because of a lack of shared standards. National datasets were also outdated or inaccessible. One major Home Office crime dataset has not been updated since 2018. Although there is an updated version, it cannot be accessed via the API provided by the Office for National Statistics (ONS).

Professor Elena Simperl, director of research at the ODI, told The Register that the findings highlight a growing gap between the volume of public data available and its practical usability.

"For crime statistics, the AI agents then went and tried to find crime statistics from somewhere else. If you don't update your data, if your metadata is not good quality and has lots of missing values, we could see from our experiments with the AI agent we built that they would just circumvent the available data. It would go elsewhere on social media and other places to try to find that information in a report somewhere, because it's much easier for them," she said.

"The government's National Data Library has huge potential, but much of the data it would rely on is not yet usable by modern AI systems. If that doesn't change, there is a risk that AI tools will increasingly rely on sources that are easier to access, rather than those that are most reliable."

A government spokesperson told us it wants to "maximise the benefits of public sector data" in a bid to make services "more efficient and grow the economy."

"Reflecting these findings, we're already overhauling the UK's digital public infrastructure through our Roadmap for Modern Digital Government.

"That includes building new infrastructure like the National Data Library in a way that ensures public sector data is shared and used more easily, upgrades to outdated systems and putting new guidance in place for the safe and ethical use of public data."

The National Data Library is the latest project designed to help researchers and data scientists find all the publicly held data they need. Launched in 2004, the Secure Research Service (SRS) offers curated, research-ready datasets to accredited researchers.

In 2020, the government planned to replace this system with the Integrated Data Service (IDS) from the ONS. However, some of its budget of £240.8 million was used – with approval from His Majesty's Treasury – to fund more general tech and data costs as the ONS struggled to get off legacy IT systems. Funding for the IDS was effectively cut in March, although existing services will continue to be available, largely within the ONS, missing one of the major objectives.

The NDL is the new plan for national data sharing to support research, machine learning, and AI. ODI's study shows the work needed to avoid being another missed opportunity. ®