EHR Data

In healthcare research, real-world data from patients is invaluable for gaining insights into disease patterns, treatment effectiveness, and outcomes. Two major sources of real-world data are claims data and electronic health records (EHRs). Both data types have distinct advantages and limitations that impact their utility for different research applications. This article examines the key differences between claims data and EHR data and how researchers can leverage both data sources to answer critical healthcare questions.

What is Claims Data?

Health insurance claims data contains information submitted by healthcare providers to payers to receive reimbursement for services rendered to patients. Claims data includes demographic details about the patient such as age, gender, location, insurance details, diagnosis codes, procedure codes, prescription details, costs and reimbursement information. Claims data provides a longitudinal view of a patient’s interactions across healthcare systems, as it captures data each time a claim is filed over months or years of care (Pivovarov et al., 2019).

Large claims clearinghouses aggregate data from millions of patients across different payers, providing massive real-world datasets. For example, IBM MarketScan databases contain claims information on over 240 million US patients collected from employers, health plans and government health programs (IBM Watson Health, 2022). Other major claims aggregators include Optum Clinformatics, Premier Healthcare Database and PharMetrics Plus.

Key Details Captured in Claims Data

  • Patient demographics – age, gender, location, insurance details
  • Diagnoses – ICD diagnosis codes
  • Procedures – CPT and HCPCS procedure codes
  • Medications – NDC codes, dose, number of prescription refills
  • Costs – total and itemized costs, amount paid by payer and patient responsibility

Claims data is extremely valuable for comparative effectiveness research, pharmacoepidemiology, health economics and outcomes research (Berger et al., 2017). The large sample sizes and longitudinal view make claims databases ideal for studying disease incidence, treatment patterns, medication adherence, healthcare costs and utilization across different patient demographics and      therapeutic areas.

Limitations of Claims Data

While claims data offers unparalleled scale and longitudinal perspective, researchers must be aware of its limitations:

  • Diagnosis codes are not indicative of a confirmatory diagnosis; for example, a provider might submit a claim with a diagnosis code that is being considered during a diagnostic workup. 
  • Diagnosis and procedure codes may be inaccurate or incomplete if providers submit improper codes. Important clinical details are missing.
  • Prescription records lack information about whether the medication was taken as directed or refilled properly.
  • Available data elements are restricted to what is required for reimbursement. No additional clinical context is provided.
  • Inability to link family members or track individuals who change payers over time.
  • Variable data quality and completeness across different claims sources.
  • Biased sampling based on specific payer population characteristics. May not represent the general population.

Despite these limitations, claims data remains highly useful for epidemiologic studies, health economics research, population health analyses and other applications where large sample sizes are critical. Researchers should account for the nuances of claims data during study design and analysis.

What are Electronic Health Records?

Electronic health records (EHRs) are a digital documentation of patient health information generated throughout clinical care. EHRs are maintained by healthcare organizations and contain various data elements documenting patient encounters, including (Hersh et al., 2013):

  • Demographics – age, gender, race, ethnicity, language
  • Medical history – conditions, diagnoses, allergies, immunizations, procedures
  • Medications – prescriptions, dosing instructions
  • Vital signs – blood pressure, heart rate, weight, height
  • Lab test results
  • Radiology images
  • Clinical notes – physician progress notes, discharge summaries

A key advantage of EHR data is its rich clinical context. While claims data only captures billing codes, EHRs include detailed narratives, quantitative measures, images and comprehensive documentation of each patient visit. This facilitates better understanding of disease presentation & progression, treatment rationale & response and patient complexity.

EHR databases aggregate records across large healthcare networks to compile real-world data on millions of patients. For instance, Vanderbilt University Medical Center’s Synthetic Derivative database contains de-identified medical records for over 3.7 million subjects and their BioVU® database contains over 310,000 DNA samples linked to de-identified medical records for genomics research (Roden et al., 2008).

EHR Data

Benefits of EHR Data

EHR data enables researchers to (Cowie et al., 2017):

  • Obtain granular clinical details beyond billing codes
  • Review physician notes and narratives for patient context
  • Link lab results, pathology reports, radiology images for additional phenotyping
  • Study unstructured data through natural language processing
  • Identify patient cohorts based on complex inclusion/exclusion criteria
  • Examine longitudinal disease patterns and treatment journeys

EHR data yields insights unattainable through claims data alone. The rich clinical details enable researchers to understand nuances in patient populations, disease manifestation and therapy response.

Challenges with EHR Data

While valued for its clinical context, EHR data also has some inherent limitations:

  • Incomplete or missing records if providers fail to properly document encounters
  • Incomplete records if patient receives care at multiple, unlinked healthcare networks
  • Inconsistent use of structured fields vs free text notes across systems
  • Lack of national standards in data formats, terminologies and definitions
  • Biased datasets dependent on specific health system patient population
  • Difficulty normalizing data across disparate EHR systems
  • Requires data science skills to analyze unstructured notes and documents
  • Requires clinical background to appropriately interpret unstructured notes and documents
  • More resource intensive for data extraction and processing compared to claims data

EHR data analysis requires specialized skills and infrastructure, especially to interpret unstructured data. Despite limitations, EHRs remain an invaluable data source on their own or as complements to other data sources like claims for comprehensive real-world evidence generation.

Integrating Claims and EHR Datasets

Given the complementary strengths of claims data and EHRs, there is significant value in integrating these datasets to conduct robust real-world studies. This can be accomplished by (Maro et al., 2019):

  • Linking claims and EHR data at the patient level via unique identifiers
  • Building cohorts based on diagnosis codes from claims data, then reviewing clinical data for each patient in the EHR
  • Using natural language processing on EHR notes to extract additional details not available in claims
  • Applying claims analysis algorithms on EHR data to identify lapses in care, adverse events, etc.
  • Incorporating prescription fills from claims with medication orders in EHRs to assess adherence
  • Using cost data from claims combined with clinical data for health economic studies

Major research networks like PCORnet have developed infrastructure to integrate claims and EHR data to support large-scale patient-centered outcomes research. When thoughtfully combined, these complementary data sources enable multifaceted real-world studies not possible using either source alone.

     Claims data and EHRs both provide invaluable real-world evidence on patient populations, but have distinct strengths and limitations. Claims data allows longitudinal analysis of diagnosis, procedure and prescription patterns at scale, but lacks clinical granularity. EHRs provide rich clinical context like physician notes, lab results and images, but lack continuity across health systems and data standardization. By integrating these sources, researchers can conduct robust real-world studies leveraging the advantages of both datasets. Careful consideration of the nuances of each data type allows generation of comprehensive real-world evidence to inform healthcare decisions and improve patient outcomes.

At NashBio, we use EHR data for most of our analytic activities because of its depth and additional clinical context, which helps us build the highest fidelity study populations for our clients.


Berger, M. L., Sox, H., Willke, R. J., Brixner, D. L., Eichler, H.-G., Goettsch, W., … Schneeweiss, S. (2017). Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety, 26(9), 1033–1039.

Cowie, M. R., Blomster, J. I., Curtis, L. H., Duclaux, S., Ford, I., Fritz, F., … Zalewski, A. (2017). Electronic health records to facilitate clinical research. Clinical Research in Cardiology, 106(1), 1–9.

Hersh, W. R., Weiner, M. G., Embi, P. J., Logan, J. R., Payne, P. R. O., Bernstam, E. V., Lehmann, H. P., Hripcsak, G., Hartzog, T. H., Cimino, J. J., & Saltz, J. H. (2013). Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical care, 51(8 0 3), S30–S37.

IBM Watson Health. (2022). IBM MarketScan Databases.

Maro, J. C., Platt, R., Holmes, J. H., Stang, P. E., Steiner, J. F., & Douglas, M. P. (2019). Design of a study to evaluate the comparative effectiveness of analytical methods to identify patients with irritable bowel syndrome using administrative claims data linked to electronic medical records. Pharmacoepidemiology and Drug Safety, 28(2), 149–157.

Pivovarov, R., Albers, D. J., Hripcsak, G., Sepulveda, J. L., & Elhadad, N. (2019). Temporal trends of hemoglobin A1c testing. Journal of the American Medical Informatics Association, 26(1), 41–48.

Roden, D. M., Pulley, J. M., Basford, M. A., Bernard, G. R., Clayton, E. W., Balser, J. R., & Masys, D. R. (2008). Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clinical pharmacology and therapeutics, 84(3), 362–369.