Improving Missing Data Analysis in Distributed Research Networks (Massachusetts)

Project Details - Ongoing

Project Categories


Electronic health record (EHR) databases collect data that reflect routine clinical care. These databases are increasingly used in comparative effectiveness research, patient-centered outcomes research, quality improvement assessment, and public health surveillance to generate actionable evidence that improves patient care. In this work it is often necessary to analyze multiple databases that cover large and diverse populations in order to improve the statistical power of a study or generalizability of the findings. Distributed research network (DRN) architecture is commonly used to access data across multiple databases because it allows data partners to maintain physical control of their data. While DRNs allow analysis of data across multiple EHRs, missing data are a challenge. Data may be missing for two reasons: 1) it was not collected--for example, the provider did not ask about smoking status; or 2) it was misclassified--for instance, an individual with asthma was incorrectly classified as not having the condition in the EHR. Optimal use of DRNs require strategies for correcting both types of missing data.

This project will refine existing methods and develop new methods to address missing data issues in EHR databases. First, the team will assess existing conventional statistical methods that have been proven effective to correct missing data in a single database within DRNs. Second, they will link EHRs with administrative claims data and apply machine learning techniques to correct misclassification of select variables thought to be more accurately and completely captured in administrative claims data. Third, they will combine the first two methods for a hybrid approach.

The specific aims of this project are as follows:

  • Apply and assess missing data methods developed in single-database settings to handle obvious and well-recognized missing data in DRNs. 
  • Apply and assess machine learning and predictive modeling techniques to address less-obvious and under-recognized missing data for select variables in DRNs. 
  • Apply and assess a comprehensive analytic approach that combines conventional missing data methods and machine learning techniques to address missing data in DRNs. 

The analytic methods developed in this project are expected to improve the quality of data used for research on comparative effectiveness, patient safety, and patient-centered outcomes conducted with in DRNs.

This project does not have any related annual summary.
This project does not have any related publication.
This project does not have any related resource.
This project does not have any related survey.
This project does not have any related project spotlight.
This project does not have any related survey.
This project does not have any related story.
This project does not have any related emerging lesson.