Table of Contents

Building Smarter Data Pipelines: How RAIDO Is Transforming AI-Ready Data Preparation

How the RAIDO project is automating data curation, annotation and federated mining to power trustworthy AI

High-quality training data is the backbone of any reliable AI system. However, data preparation turns out to be the most challenging phase of the machine learning lifecycle in terms of required efforts and the likelihood of mistakes. According to a recent survey, data preparation is essential for effective ML yet typically remains a manual, time-consuming process that takes most of the project lifecycle (Mladenovic et al., 2026).  In the context of RAIDO, there has been dedicated work in addressing this issue through the creation of a complete workflow for data enrichment, curation, distillation, and federated mining.

What the Pipeline Does

The data curation pipeline covers two major data modalities, images and time series, and offers a rich set of automated pre-processing steps. For images, this includes invalid pixel detection, outlier removal, noise filtering, data enrichment and class-balancing techniques such as oversampling and dataset distillation. For time series, the pipeline handles cleanup of missing values, imputation, outlier and noise detection, feature engineering, dimensionality reduction, and distillation.

All these functions are orchestrated through Apache Airflow, giving users a flexible and reproducible workflow engine and the tools have already been handed off for platform-level integration into the broader RAIDO system.

AutoAnnotate: Slashing Labelling Time by Up to 98.8%

One of RAIDO’s standout achievements is the AutoAnnotate toolset, a pair of open-source, modular auto-annotation tools for both images and time series. Built on a foundation-model architecture that can easily adapt to new models and clustering methods, AutoAnnotate has demonstrated annotation time reductions of up to 98.8% while maintaining accuracies as high as 99.1%.

Both tools are publicly available on PyPI (AutoAnnotate-Timeseries, AutoAnnotate-Vision) fully embodying RAIDO’s commitment to open science. A research paper on AutoAnnotate was accepted and presented at the IEEE Conference on Artificial Intelligence 2026 (Workshop W7-GPAIS), further validating the approach in the scientific community.

Federated Data Mining Across Borders

Figure: RAIDO’s Federated Data Mining Architecture

The pipeline also delivered a Federated Data Mining (FDM) system. The FDM tools produce datasets with comparable predictive value but significantly reduced volume, supporting RAIDO’s green AI goals by cutting down on the data, compute and energy needed for model training. A dedicated research paper on the time-series FDM tools was submitted to the RAIDO-GTAI workshop at IEEE-CH 2026 (link).

Looking Ahead: From Pipeline to Real-World Impact

These tools are not staying in the lab. The data curation pipeline, AutoAnnotate and FDM system are already being applied across RAIDO’s real-world pilots in energy grid management, precision agriculture and robotics, helping domain experts train better models with less data and lower energy costs. As the project enters its final year, the focus shifts to refining these integrations and demonstrating measurable impact in each pilot domain. The broader ambition is clear: making high-quality, trustworthy AI accessible not just to data science teams, but to the industries and communities that stand to benefit most.

References

Mladenovic, S.,  Lindauer, M., & Doerr, C.  (2026). “Automated Data Preparation for Machine Learning: A Survey.” OpenReview. https://openreview.net/forum?id=Euti6LHIOs

Find us on Social Media

More Insights

Consortium

TCS in RAIDO

Partner Tata Consultancy Services Belgium SA/NV (TCS) is a leading IT, consulting, and system integration company based in Belgium and part of the global Tata Consultancy Services network, one of the world’s largest IT and

Read More »
News

RAIDO Project – Plenary Meeting in Athens

Athens, Greece – From March 18th to 20th, 2026, the 4th RAIDO Plenary meeting took place. Consortium members from across Europe convened, brimming with collaborative energy and a shared vision for the future of artificial

Read More »
Newsletter

12th RAIDO Newsletter – March 2026

RAIDO MARCH 2026 NEWSLETTER Welcome to the March edition of the RAIDO Project newsletter! This month has been one of the most significant milestones in our project’s lifecycle. From the bustling streets of Athens for

Read More »
News

RAIDO’s 12th Newsletter Released

Explore the latest edition of the RAIDO Newsletter here and discover discover the key highlights from a milestone period for the project. This issue covers the 4th Plenary Meeting in Athens, progress on Trustworthy XAI,

Read More »