A hybrid approach to scalable real-world data curation by machine learning and human experts

Abstract

Machine learning has the potential to increase the scale of real-world data curated from electronic health records, but maintaining a high standard of data quality is important to avoid biasing downstream analyses. To increase scale without compromising quality, we propose a hybrid data curation methodology that employs both manual abstraction by clinical experts and automated extraction by machine learning models. Our methodology makes the determination about when to employ manual abstraction using a confidence score associated with each model output. We describe a process for selecting confidence thresholds based on simulations validated against a reference-standard labeled dataset. To establish the fitness of our methodology for retrospective research, we apply it to a multi-variable cohort selection task on a large real-world oncology database. We find that only small amounts of manual abstraction are required for hybrid curation to achieve expert-level error rates. In fact, the hybrid methodology can even reduce error rates relative to manual abstraction in some cases. We further demonstrate that demographic characteristics of a research cohort defined using hybrid variables are comparable to one curated with conventional methods. Our methodology is general and makes few assumptions about the clinical variable or machine learning model. A key requirement is the availability of reference standard labels for calibrating the tradeoff between abstraction effort and data quality. Incorporating machine learning into real-world data curation using hybrid methodology holds the promise to scale practicable cohort sizes while maintaining data fitness for research purposes.

Publication
Under Review