News

Advancing trustworthy data through ETL validation and data quality assessment

Publiced on:

June 9, 2026

On 21 May, the fourth INDICATE training session brought together participants from across the consortium to explore one of the most critical aspects of data integration: ensuring data quality throughout the ETL (Extract, Transform, Load) process.

Led by Celia Alvarez Romero and María González-Lopez from the Computational Health Informatics Group at Virgen del Rocío University Hospital, the training focused on practical approaches to data validation, quality assessment, and troubleshooting within the INDICATE data infrastructure.

Participants revisited the process of transforming intensive care data from different hospital systems into common standards that enable information to be shared and analysed consistently across institutions. The discussion highlighted that achieving this requires more than technical tools and data standards. It depends on close collaboration between clinicians, data experts, terminology specialists, and technical teams to ensure that clinical information is accurately represented and can be reused for research and innovation.

A major focus was on data quality assessment. Rather than discussing validation only from a theoretical perspective, participants followed the fictional journey of “Hospital A”, a European hospital preparing to join INDICATE as a data provider. After successfully completing its ETL process and creating a local OMOP instance, Hospital A began evaluating the quality of its transformed data.

Using a series of realistic scenarios, attendees explored common challenges such as missing data, incorrect units, mapping inconsistencies, and implausible values. Together, they discussed the most likely causes of these issues and identified practical solutions.

Participants also reviewed the results of a recent survey among INDICATE data providers across Europe. The findings of this survey are that many hospitals already have strong foundations in place for sharing and secondary use of intensive care data for research purposes. At the same time, the survey highlighted several common challenges, including limited staff capacity for data integration and ETL work, difficulties in achieving interoperability across different hospital systems, and varying levels of maturity in data quality management and validation processes.

One of the central takeaways was that data quality is not a one-time task but an ongoing, collaborative effort. Effective validation requires continuous communication between clinicians, data stewards, ETL developers, and technical experts to ensure that data remains accurate, complete, and fit for federated research and AI applications.

The INDICATE Training programme on Data Model & Data Enablement consists of five sessions. The next and last session will take place on June 17, 14:00–16:00 CEST.