A hybrid approach to scalable real-world data curation by machine learning and human experts

Abstract

Machine learning has the potential to increase the scale of real-world data curated from electronic health records, but maintaining a high standard of data quality is important to avoid biasing downstream analyses. To increase scale without compromising quality, we propose a hybrid data curation methodology that employs both manual abstraction by clinical experts and automated extraction by machine learning models. Our methodology makes the determination about when to employ manual abstraction using a confidence score associated with each model output. We describe a process for selecting confidence thresholds based on simulations validated against a reference-standard labeled dataset. To establish the fitness of our methodology for retrospective research, we apply it to a multi-variable cohort selection task on a large real-world oncology database. This is joint work with Michael Waskom, Katherine Tan, Aaron Cohen, Brett Wittmerhaus, and Will Shapiro (Flatiron Health).

Date
Event
Location
INFORMS 2023