All Posts By


Polygenic Risk Scores

An Introduction to Polygenic Risk Scores: Aggregating Small Genetic Effects to Stratify Disease Risk

By | Polygenic Rick Scores

Key Takeaways:

  • Polygenic risk scores aggregate the effects of thousands of genetic variants to estimate an individual’s inherited risk for complex diseases.
  • Polygenic risk is based on genome-wide association studies that identify common variants associated with modest increases in disease risk.
  • Polygenic scores provide risk stratification beyond family history, but most disease risk is not yet explained by known variants.
  • Clinical validity and utility of polygenic scores will improve as more disease-associated variants are discovered through large genomic studies.
  • Polygenic risk models may one day guide targeted screening and preventive interventions, but face challenges related to clinical interpretation and implementation.

Introduction to Polygenic Risk Scores

The vast majority of common, chronic diseases do not follow simple Mendelian inheritance patterns, but rather are complex genetic conditions arising from the combined small effects of thousands of genetic variations interacting with lifestyle and environmental factors. Polygenic risk scores aggregate information across an individual’s genome to estimate their inherited susceptibility for developing complex diseases like heart disease, cancer, diabetes and neuropsychiatric disorders.

Polygenic risk scores are constructed using data from genome-wide association studies (GWAS) that scan markers across the genomes of thousands to millions of individuals to identify genetic variants associated with specific disease outcomes. While most disease-associated variants have very small individual effects, the combined effect of thousands of these common, single nucleotide polymorphisms (SNPs) can stratify disease risk in a polygenic model.

Polygenic Scores vs. Single Gene Mutations

In monogenic diseases like cystic fibrosis and Huntington’s disease, a single genetic variant is necessary and sufficient to cause disease. Genetic testing for causal mutations in specific disease-linked genes provides a clear-cut diagnostic assessment. In contrast, no single gene variant accounts for more than a tiny fraction of risk for complex common diseases. Polygenic risk models aggregate the effects of disease-associated variants across the genome, each imparting a very modest increase or decrease in risk. An individual’s polygenic risk score reflects the cumulative impact of thousands of small risk effects spread across their genome.

While polygenic scores are probabilistic and estimate only inherited genetic susceptibility, monogenic mutations convey deterministic information about disease occurrence. However, for many individuals with elevated polygenic risk scores, modifiable lifestyle and environmental factors may outweigh their inherited predisposition, allowing prevention through early intervention.

GWAS and Polygenic Scores

Human genome-wide association studies utilize DNA microarray ‘chips’ containing hundreds of thousands to millions of SNPs across the genome. Comparing SNP frequencies between thousands of disease cases and controls reveals variations associated with disease diagnosis. Each SNP represents a common genetic variant present in more than 1-5% of the population. Individually, SNP effects on disease risk are very modest, usually less than 20% increase in relative risk.

However, by aggregating the effects of disease-associated SNPs, polygenic risk models can categorize individuals along a spectrum of low to high inherited risk. Polygenic scores typically explain 7-12% of disease variance, though up to 25% for some cancers. The more powerful the original GWAS in terms of sample size, the better the polygenic score will be at predicting an individual’s predisposition.

Constructing Polygenic Scores

Various methods exist for constructing polygenic scores after identifying disease-associated SNPs through GWAS. Most commonly, a SNP effect size is multiplied by the number of risk alleles (0, 1 or 2) for that SNP in a given individual. These products are summed across all chosen SNPs to derive an overall polygenic risk score. SNPs strongly associated with disease receive more weight than weakly associated markers.

Rigorous validation in independent sample sets evaluates the predictive performance of polygenic scores. Optimal SNP inclusion thresholds are selected to maximize predictive ability. Polygenic models lose power with too few or too many SNPs included. Ideal thresholds retain SNPs explaining at least 0.01% of disease variance based on GWAS significance levels.

Applications and Limitations

Polygenic risk models are currently most advanced for coronary artery disease, breast and prostate cancer, type 2 diabetes and inflammatory bowel disease. Potential clinical applications include:

  • Risk stratification to guide evidence-based screening recommendations beyond family history.
  • Targeted prevention and lifestyle modification for individuals at elevated genetic risk.
  • Informing reproductive decision-making and genetic counseling based on polygenic risk.
  • Improving disease prediction, subtyping and prognosis when combined with clinical risk factors.

However, limitations and ethical concerns exist around polygenic score implementation:

  • Most heritability remains unexplained. Adding more SNPs only incrementally improves prediction.
  • Polygenic testing may prompt unnecessary interventions if clinical validity and utility are not adequately demonstrated.
  • Possible psychological harm and discrimination from genetic risk probabilization.
  • Unequal health benefits if not equitably implemented across populations.

While polygenic scores currently identify individuals with modestly increased or decreased disease risks, their predictive utility is anticipated to grow exponentially with million-person biobank efforts and whole-genome sequencing. Harnessing the full spectrum of genomic variation contributing to polygenic inheritance will enable more personalized risk assessment and clinical decision-making for complex chronic diseases.


  1. Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat Rev Genet. 2018 Sep;19(9):581-590.
  2. Lewis CM, Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020 May 6;12(1):44.
  3. Khera AV, Chaffin M, Zekavat SM, et al. Whole-genome sequencing to characterize monogenic and polygenic contributions in patients hospitalized with COVID-19. Nat Commun. 2021 Jan 20;12(1):536.
  4. Torkamani A, Erion G, Wang J, et al. An evaluation of polygenic risk scores for predicting breast cancer. Breast Cancer Res Treat. 2019 Apr;175(2):493-503.
  5. Mars N, Koskela JT, Ripatti P, Kiiskinen T TJ, Havulinna AS, Lindbohm JV, Ahola-Olli A, Kurki M, Karjalainen J, Palta P, FinnGen, Neale B, Daly M, Salomaa V, Palotie A, Collins F, Samani N, Ripatti S. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat Med. 2020 Nov;26(11):1660-1666.
Clinical Trials vs. Real-World

Clinical Trials vs. Real-World Data: Understanding the Differences and Complementary Roles

By | Clinical Trials

Key Takeaways:

  • Clinical trials are controlled experiments designed to evaluate safety and efficacy of new drugs or devices. Real-world data comes from more diverse, less controlled sources like electronic health records and medical claims.
  • Clinical trials have strict inclusion/exclusion criteria and measure predefined outcomes. Real-world data reflects broader populations with various comorbidities and outcomes.
  • Clinical trials are required for regulatory approval but have limitations like small sample sizes. Real-world evidence can complement trials with larger volumes of data over longer time periods.
  • Real-world data comes from routine clinical practice rather than protocol-driven trials. It provides supplementary information on effectiveness and safety.
  • Limitations of real-world data include lack of randomization, potential biases and confounders. Analytic methods help account for these limitations.
  • Real-world evidence has growing applications in medical product development, post-market surveillance, regulatory decisions and clinical guideline development.

Clinical Trials vs. Real-World Data

Clinical trials are prospective studies that systematically evaluate the safety and efficacy of investigational drugs, devices or treatment strategies in accordance with predefined protocols and statistical analysis plans. They are considered the gold standard for assessing the benefits and risks of medical interventions prior to regulatory approval. In clinical trials, participants are assigned to receive an investigational product or comparator/placebo according to a randomized scheme. These studies are designed to minimize bias and carefully control variables that may affect outcomes. Participants are closely monitored per protocol, and data is collected on prespecified points in time. The resulting evidence from randomized controlled trials serves as the primary basis for regulatory decisions regarding drug and device approvals.

In contrast, real-world data (RWD) refers to data derived from various non-experimental or observational sources that reflect routine clinical practice. Sources of RWD include electronic health records (EHRs), medical claims, registry data and patient-generated data from mobile devices, surveys or wearables. Real-world evidence (RWE) is the clinical evidence generated from aggregation and analysis of RWD. While clinical trials evaluate medical products under ideal, controlled conditions in limited samples of patients, RWD offers information about usage, effectiveness and safety in broader patient populations in real-world settings.

Some key differences between clinical trials and real-world data:

Sample Populations – Clinical trials have strict inclusion and exclusion criteria, resulting in homogeneous samples that often under represent minorities, elderly, pediatric and complex patient groups. RWD reflects more diverse real-world populations with various comorbidities and concomitant medications.

  • Settings – Clinical trials are conducted at specialized research sites under tightly controlled conditions. RWD comes from routine care settings like hospitals, clinics and pharmacies across diverse geographies and populations.
  • Interventions – Clinical trials administer interventions per protocol. RWD reflects variabilities in real-world treatment patterns and patient adherence.
  • Outcomes – Clinical trials measure prespecified outcomes over limited timeframes. RWD captures broader outcomes like patient-reported outcomes, quality of life, hospitalizations and costs over longer periods in real-world practice.
  • Data Collection – Clinical trials collect data per protocol at predefined assessment points. RWD is collected during routine care and reflected in patient records and claims.
  • Sample Size – Clinical trials often have small sample sizes with a few hundred to several thousand patients. RWD encompasses data from tens or hundreds of thousands of patients.
  • Randomization – Clinical trials use randomization to minimize bias when assigning interventions. RWD studies are observational without the benefits of randomization.

While randomized controlled trials provide high quality evidence for drug/device approvals and clinical recommendations, RWD offers complementary information on effectiveness, safety, prescribing patterns and health outcomes:

  • RWD can provide broader demographic representation for subpopulations underrepresented in trials.
  • RWD can inform on long-term safety, durability of treatment effects and comparative effectiveness between therapies.
  • RWD can provide larger sample sizes to study rare events or outcomes.
  • RWD can reflect real-world utilization rates, switching patterns and adherence to therapies.
  • RWD offers granular data for personalized medicine, risk identification, prediction modeling and tailored interventions.
  • RWD is more timely, cost-effective and scalable than conducting large trials.

However, RWD has inherent limitations compared to clinical trials:

  • Lack of randomization increases potential for bias and confounding.
  • Incomplete data or misclassification errors are common with medical records.
  • Inability to firmly conclude causality due to observational nature.
  • Possible selection biases and variations in care delivery across settings.
  • Inconsistencies in definitions, coding, documentation practices over time and sites.

Analytical methods help account for these limitations when generating real-world evidence from RWD:

  • Advanced analytics like machine learning can identify trends and associations within large RWD.
  • Predictive modeling and simulations can estimate treatment effects.
  • Adjusting for confounders, stratification, matching patients, propensity scoring help reduce biases.
  • Expert review of data and methodology helps ensure reliability.

Applications of RWE are expanding and gaining acceptance from key stakeholders:

  • Supplement clinical trial data for regulatory, coverage and payment decisions around medical products.
  • Post-market surveillance of drug and device safety and utilization in real-world practice.
  • Life cycle evidence generation for new indications, formulations, combination products.
  • Provide inputs into clinical guidelines by professional societies.
  • Risk identification/stratification, predictive modeling and personalized medicine.
  • Value-based contracting between manufacturers and payers.
  • Risk management and safety programs for hospitals and health systems.

In summary, clinical trials provide foundational evidence to introduce new medical products, while RWE offers complementary insights on effectiveness, safety, prescribing patterns and health outcomes at a larger scale across diverse real-world populations. Advanced analytics help derive meaningful RWE from RWD, with growing applications across the healthcare life science ecosystems. Together, these sources of evidence offer a multifaceted understanding to guide optimal use of medical products and improve patient care.


  1. What are the different types of clinical research?
  2. Berger ML, et al. Real-World Evidence: What It Is and What It Can Tell Us According to the ISPOR Real-World Data Task Force. Value Health. 2021 Sep;24(9):1197-1204.
  3. Sherman RE, et al. Real-World Evidence – What Is It and What Can It Tell Us? N Engl J Med. 2016 Dec 8;375(23):2293-2297.
  4. Yu T, et al. Benefits, Limitations, and Misconceptions of Real-World Data Analyses to Evaluate Comparative Effectiveness and Safety of Medical Products. Clin Pharmacol Ther. 2019 Oct;106(4):765-778.
  5. Food and Drug Administration. Real-World Evidence.
Polygenic risk score

The Role of Polygenic Risk Scores in Clinical Genomics

By | Clinical Genomics


We were promised the end to genetic diseases. All we needed to do was unlock the human genome. Unfortunately, life has a way of being more complicated than we expect. It turned out that many genetic disorders are the result of the interplay between multiple genetic factors. This set off the need for improved analytical tools to analyze human genetics that could interrogate the associations of many genetic backgrounds and link them to various diseases. One such technique, the Polygenic Risk Score (PRS), emerged as a powerful tool to quantify the cumulative effects of multiple genetic variants on an individual’s predisposition to a specific disease.

The Evolution of Polygenic Risk Scores

The genesis of PRS can be traced back to the early 2000s when researchers sought to comprehend the collective impact of multiple genetic variants on disease susceptibility. Initially viewed through a biological lens, the focus was on enhancing the prediction of diseases by analyzing subtle genomic variations. Studies concentrated on prevalent yet complex diseases such as diabetes, cardiovascular diseases, and cancer, laying the groundwork for a comprehensive understanding of their genetic architecture.

That was until Dr. Sekar Kathiresan showed that the prediction from a PRS was just as clinically useful as a single variant (Khera et al., 2018). Instead of looking at the percent of people with a PRS in each group (with or without a disease), his group could show a much more obvious effect – the difference in risk for people in the groups with the highest and lowest scores. Then, they could say that there was a huge difference in risk for these two edges of the population.

In the initial stages, PRSs consisted of only the most statistically significant variants from genome-wide association studies. Geneticists often added up the quantity of risk variants without giving them a weight for how much of an impact they had on whether someone would get a disease. Refining these scores led scientists to challenge arbitrary risk cutoffs and advocate for the inclusion of all variants to maximize statistical power (based on the assumption that, on average, variants that have no effect are evenly distributed to appear positively or negatively correlated to the trait). However, proximity of variants on a chromosome presented another challenge. If variants were closer together on a chromosome, they would be less likely to be separated during recombination (Linkage Disequilibrium). This would result in them carrying the signal of something that had a true effect, potentially leading to an overcounting of that signal.

To deal with this, geneticists used tools to remove signals within a specified block unless their correlation with the strongest signal fell below a threshold. One of the first packages, PRSice (Choi & O’Reilly, 2019), used an approach called Pruning and Thresholding. Scientists would choose a block size, say, 200,000 base pairs. A program would go through and slide that block along the genome. If there was more than a single signal in that block, the program would remove (or “prune”) all but the strongest signal unless the variant had a smaller correlation with the strongest signal than the “threshold”. The result was that in a region with many different variants that affected the risk of a disease, but which were still a bit correlated, signal could be lost.

Criticism from biostatisticians prompted a shift towards a Bayesian approach, reducing over-counting while better accounting for partially independent signals. Implementation was challenged by the extensive computational resources needed to update the signal at each genetic location based on linkage disequilibrium of the surrounding SNPs. One program, called PRS-CS (Ge et al., 2019), implemented a method that could apply changes to a whole linkage block at once, addressing both the geneticist demand for a good system that can provide results using the computation tools we have and the biostatistician demand for accuracy and retained information.

Despite these advancements, accuracy challenges persisted, particularly when applying scoring systems across populations with different genetic ancestries. It turned out Linkage Disequilibrium was a pervasive problem. The patterns of Linkage Disequilibrium are different in people with different genetic ancestries. In fact, even statistics about the patterns themselves, like how big an average block size is, are different. Recognizing the need for improvement, ongoing efforts in refining PRSs aim to address these challenges, paving the way for more accurate and reliable applications. As researchers delve deeper into these complexities, the evolving landscape of PRSs continues to shape the future of clinical research.

Polygenic Risk Scores in Clinical Research Settings

To harness the full potential of PRS in clinical practice, a crucial shift is needed—from population-level insights to personalized predictions for individual patients. This transformation involves converting relative risks, which compare individuals across the PRS spectrum with a baseline group, into absolute risks for the specific disease (Lewis & Vassos, 2020). The current emphasis is on identifying individuals with a high genetic predisposition to disease, forming the foundation for effective risk stratification. This information guides decisions related to participation in screening programs, lifestyle modifications, or preventive treatments when deemed suitable.

In practical applications, PRS demonstrates promise in patient populations with a high likelihood of disease. Consider a recent study in an East Asian population, where researchers developed a PRS for Coronary Artery Disease (CAD) using 540 genetic variants (Lu et al., 2022). Tested on 41,271 individuals, the top 20% had a three-fold higher risk of CAD compared to the bottom 20%, with lifetime risks of 15.9% and 5.8%, respectively. Adding PRS to clinical risk assessment slightly improved accuracy. Notably, individuals with intermediate clinical risk and high PRS reached risk levels similar to high clinical risk individuals with intermediate PRS, indicating the potential of PRS to refine risk assessment and identify those requiring targeted interventions for CAD.

Another application of PRS lies in improving screening for individuals with major disease risk alleles (Roberts et al., 2023). A recent breast cancer risk assessment study explored pathogenic variants in high and moderate-risk genes (Gao et al., 2021). Over 95% of BRCA1, BRCA2, and PALB2 carriers had a lifetime breast cancer risk exceeding 20%. Conversely, integrating PRS identified over 30% of CHEK2 and almost half of ATM carriers below the 20% threshold. Indeed, a similar result was found in a separate study when researchers investigated men with high blood levels of prostate-specific antigen (PSA). 

This trend extends to other diseases, such as prostate cancer, where a separate investigation focused on men with elevated levels of prostate-specific antigen (PSA) (Shi et al., 2023). Through the application of PRS, researchers pinpointed over 100 genetic variations linked to increased PSA levels. Ordinarily, such elevated PSA levels would prompt prostate biopsies to assess potential prostate cancer. By incorporating PRS into the screening process, doctors could have accounted for the natural variation in PSA level and prevent unnecessary escalation of clinical care. These two studies suggest that PRS integration into health screening enhances accuracy, preventing unnecessary tests and enabling more personalized risk management.

In the realm of pharmacogenetics, efforts to optimize treatment responses continue. While progress has been made in identifying rare high-risk variants linked to adverse drug events, predicting treatment effectiveness remains challenging. The evolving role of PRS in treatment response is particularly evident in statin use for reducing initial coronary events. In a real-world cohort without prior myocardial infarction, an investigation revealed that statin effectiveness varied based on CHD PRSs, with the highest impact in the high-risk group, intermediate in the intermediate-risk group, and the smallest effect in the low-risk group (Oni-Orisan et al., 2022). Post-hoc analyses like this for therapeutics could potentially allow for more targeted enrollment for clinical trial design, substantially reducing the number of participants needed to demonstrate trial efficacy (Fahed et al., 2022).


As the field of genetics continues to advance, PRSs emerge as a potent tool with the potential to aid clinical research. Validated PRSs show promise in enhancing the design and execution of clinical trials, refining disease screening, and developing personalized treatment strategies to improve the overall health and well-being of patients. However, it’s crucial to acknowledge that the majority of PRS studies heavily rely on biased datasets of European ancestry. To refine and improve PRS, a comprehensive understanding of population genetic traits for people of all backgrounds, such as linkage disequilibrium, is essential. Moving forward, the integration of PRS into clinical applications must prioritize datasets with diverse ancestry to ensure equitable and effective utilization across all patient backgrounds. As research in this field progresses, the incorporation of PRS is poised to become an indispensable tool for expediting the development of safer and more efficacious therapeutics.


Choi, S. W., & O’Reilly, P. F. (2019). PRSice-2: Polygenic Risk Score software for biobank-scale data. GigaScience, 8(7).

Fahed, A. C., Philippakis, A. A., & Khera, A. V. (2022). The potential of polygenic scores to improve cost and efficiency of clinical trials. Nature Communications, 13(1), 2922.

Gao, C., Polley, E. C., Hart, S. N., Huang, H., Hu, C., Gnanaolivu, R., Lilyquist, J., Boddicker, N. J., Na, J., Ambrosone, C. B., Auer, P. L., Bernstein, L., Burnside, E. S., Eliassen, A. H., Gaudet, M. M., Haiman, C., Hunter, D. J., Jacobs, E. J., John, E. M., … Kraft, P. (2021). Risk of Breast Cancer Among Carriers of Pathogenic Variants in Breast Cancer Predisposition Genes Varies by Polygenic Risk Score. Journal of Clinical Oncology : Official Journal of the American Society of Clinical Oncology, 39(23), 2564–2573.

Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications, 10(1), 1776.

Khera, A. V., Chaffin, M., Aragam, K. G., Haas, M. E., Roselli, C., Choi, S. H., Natarajan, P., Lander, E. S., Lubitz, S. A., Ellinor, P. T., & Kathiresan, S. (2018). Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics, 50(9), 1219–1224.

Lewis, C. M., & Vassos, E. (2020). Polygenic risk scores: from research tools to clinical instruments. Genome Medicine, 12(1), 44.

Lu, X., Liu, Z., Cui, Q., Liu, F., Li, J., Niu, X., Shen, C., Hu, D., Huang, K., Chen, J., Xing, X., Zhao, Y., Lu, F., Liu, X., Cao, J., Chen, S., Ma, H., Yu, L., Wu, X., … Gu, D. (2022). A polygenic risk score improves risk stratification of coronary artery disease: a large-scale prospective Chinese cohort study. European Heart Journal, 43(18), 1702–1711.

Oni-Orisan, A., Haldar, T., Cayabyab, M. A. S., Ranatunga, D. K., Hoffmann, T. J., Iribarren, C., Krauss, R. M., & Risch, N. (2022). Polygenic Risk Score and Statin Relative Risk Reduction for Primary Prevention of Myocardial Infarction in a Real-World Population. Clinical Pharmacology and Therapeutics, 112(5), 1070–1078.

Roberts, E., Howell, S., & Evans, D. G. (2023). Polygenic risk scores and breast cancer risk prediction. Breast (Edinburgh, Scotland), 67, 71–77.

Shi, M., Shelley, J. P., Schaffer, K. R., Tosoian, J. J., Bagheri, M., Witte, J. S., Kachuri, L., & Mosley, J. D. (2023). Clinical consequences of a genetic predisposition toward higher benign prostate-specific antigen levels. EBioMedicine, 97, 104838.

drug response data

Bridging the Gap: How AI Companies in the TechBio Space are Revolutionizing Biopharma using Genomics and Drug Response Data

By | AI | No Comments


Innovations in Artificial Intelligence (AI) have propelled pharmaceutical companies to revolutionize their approaches to designing, testing, and bringing precision medicine and healthcare solutions to the market. Two key elements in advancing precision medicine include early disease detection and understanding drug responders within distinct populations. By leveraging genomics and clinical notes, AI companies, specifically in the TechBio space, are transforming the way biopharma industries identify, understand, and cater to individuals rather than whole populations.

The Challenge: Precision Medicine and Drug Response

Traditional drug development methods often analyze the success of a drug treatment as its effect on a patient population, leading to highly variable outcomes and adverse effects among individual patients. This is despite the fact that for many diseases the underlying mechanisms driving symptoms can be quite different from person to person. This lack of individualization in treatment can hinder therapeutic efficacy at the group level despite effectiveness for certain individuals. If we hope to accelerate drug development to get cures in the hands of people faster, future research needs intelligent, cost-effective methods to stratify patients based on the contribution of different disease mechanisms and drug processing capabilities. AI companies are helping biopharma address this challenge by incorporating genomics and insights garnered from those individual’s de-identified patient clinical charts in a systematic way.

Genomics: The Blueprint of Personalization

The genomic revolution has undoubtedly paved the way for precision medicine in Biopharma. By analyzing an individual’s genetic data, scientists can identify variations that may influence drug metabolism and response. This approach has already proven highly effective, particularly in the case of breast cancer patients. In some instances of breast cancer, there is an overexpression of the HER2/neu protein (Gutierrez & Schiff, 2011). When genomic markers for this overexpression are identified, anti-HER2 antibodies can be incorporated into the treatment regimen, significantly enhancing survival rates. AI companies are at the forefront of continuing this research by utilizing genomics for the creation of genetic sub-groups essential for biomarker discovery and predicting the most effective drug treatments for individual patients (Quazi, 2022).

Disease Detection and Monitoring with AI-Enhanced Biomarker Research

Early detection and monitoring of disease progression are paramount for improving patient survival rates. Traditionally, biomarker research has focused on identifying individual molecules or transcripts that can serve as early indicators of future severe illness. However, the field is evolving beyond the notion of a single-molecule biomarker diagnostic. Instead, it is turning to AI to examine the relationships between molecules and transcripts, offering a more comprehensive approach to identifying the onset of significant diseases (Vazquez-Levin et al., 2023). Over the past decade, cancer research and clinical decision-making have undergone a significant transformation, shifting from qualitative data to a wealth of quantitative digital information.

Universities and clinical institutions globally have contributed a vast trove of biomarkers and imaging data. This extensive dataset encompasses insights from genomics, proteomics, metabolomics, and various omics disciplines, as well as inputs from oncology clinics, epidemiology, and medical imaging. AI, uniquely positioned to integrate this diverse information, holds the potential to spearhead the development of pioneering predictive models for drug responses, paving the way for groundbreaking advancements in disease diagnosis, treatment prediction, and overall decision-making concerning novel therapies. 

With growing collections of data, it is becoming easier to model how a drug will shift an individual’s biology for worse or better. A recent example of this modelling is in the Cancer Patient Digital Twin (CPDT) project, where, the collection of multimodal temporal data from cancer patients can be employed to build a Digital Twin (a virtual replica of a patient’s biological processes and health status), allowing for in silico experimentation, which may guide testing, treatment, or decision points (Stahlberg et al., 2022).

One example is how the detection of metastatic disease over time could be improved from radiology reports. Researchers exposed prediction models to historical information using Natural Language Processing (NLP) (Batch et al., 2022). The authors were able to extract and encode relevant features from medical text reports, and use these features to develop, train, and validate models. Over 700 thousand radiology reports were used for model development to predict the presence of metastatic disease. Results from this study suggest that NLP models can extract cancer progression patterns from multiple consecutive reports and predict the presence of metastatic disease in multiple organs with higher performance than previous analytical techniques. Early knowledge of disease states or disease risk could lead to revised risk:benefit assessments for treatments and testing, potentially influencing patients’ choices. As a result, patients with otherwise comparable profiles may opt for treatments or tests they would not have otherwise considered. Even in cases where we do not have good biomarkers for disease (for example, Alzheimer’s disease, where most of the biomarkers are quite invasive to collect), knowing that a person has a higher disease risk earlier can enable important research that can lead to better biomarkers and, ultimately, better treatments.     

AI-Driven Pharmacogenomics: Revolutionizing Precision Medicine and Clinical Trials

While traditional approaches have paved the way for tailored medical treatments, the integration of AI can supercharge these efforts by leveraging an individual’s genetic information. For instance, consider the case of Warfarin, a widely prescribed anticoagulant. Accurate dosing for Warfarin is critical during the start of treatment, which carries higher risks of bleeding and clotting issues. Over decades, dose-response models have been developed to better understand how this drug affects the human body (Holford, 1986). To improve on Warfarin anticoagulation therapy, algorithms have incorporated genetic information to aid in identifying the factors behind clotting issues like Warfarin clearance rate, improving dosage and therapy (Gong et al., 2011). 

Now, with the power of AI, researchers can expedite the personalization of treatments for various disorders and medications, similar to what was accomplished with Warfarin but in a fraction of the time. AI algorithms are starting to analyze an individual’s genetic profile to predict their specific responses to various medications. This approach enables healthcare providers to fine-tune treatment plans, taking into account an individual’s unique genetic makeup, thus optimizing the effectiveness of therapies and reducing the potential for adverse effects. The integration of AI not only enhances the precision of pharmacogenomics but also streamlines the process, ultimately leading to safer and more efficient medical care tailored to each patient’s genetic characteristics.

The ultimate aspiration is to develop a sophisticated AI-driven system that can accurately forecast how each individual will react to specific medications, with the potential to bypass the conventional, time-consuming method of starting with the lowest effective dose and incrementally adjusting it upwards. This trial-and-error approach often leads to prolonged periods of uncertainty and potential adverse side effects for patients. Such advancements not only boost the precision of healthcare but also elevate the overall quality of life for patients seeking rapid relief and improved well-being.

Moreover, the integration of AI in pharmacogenomics has the potential to significantly expedite clinical trial programs. By tailoring medication doses to specific genetic backgrounds, AI aids at all three phases of the clinical trial process. This approach not only streamlines the trials but also offers substantial time and cost savings. The ability to tailor treatments for different genetic subgroups ensures that clinical trials are more efficient, bringing new therapies to market faster and ultimately benefiting patients in need.


The union of genomics and clinical notes, facilitated by AI, is ushering in a new era of precision medicine in biopharma. With the ability to predict individual drug responses and identify targeted therapies, this approach holds immense promise for improved treatment outcomes and a patient-centric view of medicine. As AI companies continue to advance their capabilities, the future of precision medicine for many diseases is looking closer than ever. The key to unlocking its full potential lies in the availability of high-quality data that comprehensively spans the entire patient journey. The integration of such diverse health-related data is central to driving valuable insights for drug development, making AI a driving force in the future of healthcare.



Batch, K. E., Yue, J., Darcovich, A., Lupton, K., Liu, C. C., Woodlock, D. P., El Amine, M. A. K., Causa-Andrieu, P. I., Gazit, L., Nguyen, G. H., Zulkernine, F., Do, R. K. G., & Simpson, A. L. (2022). Developing a Cancer Digital Twin: Supervised Metastases Detection From Consecutive Structured Radiology Reports. Frontiers in Artificial Intelligence, 5.

Gong, I. Y., Schwarz, U. I., Crown, N., Dresser, G. K., Lazo-Langner, A., Zou, G., Roden, D. M., Stein, C. M., Rodger, M., Wells, P. S., Kim, R. B., & Tirona, R. G. (2011). Clinical and genetic determinants of warfarin pharmacokinetics and pharmacodynamics during treatment initiation. PloS One, 6(11), e27808.

Gutierrez, C., & Schiff, R. (2011). HER2: biology, detection, and clinical implications. Archives of Pathology & Laboratory Medicine, 135(1), 55–62.

Holford, N. H. (1986). Clinical pharmacokinetics and pharmacodynamics of warfarin. Understanding the dose-effect relationship. Clinical Pharmacokinetics, 11(6), 483–504.

Quazi, S. (2022). Artificial intelligence and machine learning in precision and genomic medicine. Medical Oncology (Northwood, London, England), 39(8), 120.

Stahlberg, E. A., Abdel-Rahman, M., Aguilar, B., Asadpoure, A., Beckman, R. A., Borkon, L. L., Bryan, J. N., Cebulla, C. M., Chang, Y. H., Chatterjee, A., Deng, J., Dolatshahi, S., Gevaert, O., Greenspan, E. J., Hao, W., Hernandez-Boussard, T., Jackson, P. R., Kuijjer, M., Lee, A., … Zervantonakis, I. (2022). Exploring approaches for predictive cancer patient digital twins: Opportunities for collaboration and innovation. Frontiers in Digital Health, 4, 1007784.

Vazquez-Levin, M. H., Reventos, J., & Zaki, G. (2023). Editorial: Artificial intelligence: A step forward in biomarker discovery and integration towards improved cancer diagnosis and treatment. Frontiers in Oncology, 13, 1161118.

EHR Data

Claims Data vs EHRs: Distinct but United in Real-World Research

By | EHR

In healthcare research, real-world data from patients is invaluable for gaining insights into disease patterns, treatment effectiveness, and outcomes. Two major sources of real-world data are claims data and electronic health records (EHRs). Both data types have distinct advantages and limitations that impact their utility for different research applications. This article examines the key differences between claims data and EHR data and how researchers can leverage both data sources to answer critical healthcare questions.

What is Claims Data?

Health insurance claims data contains information submitted by healthcare providers to payers to receive reimbursement for services rendered to patients. Claims data includes demographic details about the patient such as age, gender, location, insurance details, diagnosis codes, procedure codes, prescription details, costs and reimbursement information. Claims data provides a longitudinal view of a patient’s interactions across healthcare systems, as it captures data each time a claim is filed over months or years of care (Pivovarov et al., 2019).

Large claims clearinghouses aggregate data from millions of patients across different payers, providing massive real-world datasets. For example, IBM MarketScan databases contain claims information on over 240 million US patients collected from employers, health plans and government health programs (IBM Watson Health, 2022). Other major claims aggregators include Optum Clinformatics, Premier Healthcare Database and PharMetrics Plus.

Key Details Captured in Claims Data

  • Patient demographics – age, gender, location, insurance details
  • Diagnoses – ICD diagnosis codes
  • Procedures – CPT and HCPCS procedure codes
  • Medications – NDC codes, dose, number of prescription refills
  • Costs – total and itemized costs, amount paid by payer and patient responsibility

Claims data is extremely valuable for comparative effectiveness research, pharmacoepidemiology, health economics and outcomes research (Berger et al., 2017). The large sample sizes and longitudinal view make claims databases ideal for studying disease incidence, treatment patterns, medication adherence, healthcare costs and utilization across different patient demographics and      therapeutic areas.

Limitations of Claims Data

While claims data offers unparalleled scale and longitudinal perspective, researchers must be aware of its limitations:

  • Diagnosis codes are not indicative of a confirmatory diagnosis; for example, a provider might submit a claim with a diagnosis code that is being considered during a diagnostic workup. 
  • Diagnosis and procedure codes may be inaccurate or incomplete if providers submit improper codes. Important clinical details are missing.
  • Prescription records lack information about whether the medication was taken as directed or refilled properly.
  • Available data elements are restricted to what is required for reimbursement. No additional clinical context is provided.
  • Inability to link family members or track individuals who change payers over time.
  • Variable data quality and completeness across different claims sources.
  • Biased sampling based on specific payer population characteristics. May not represent the general population.

Despite these limitations, claims data remains highly useful for epidemiologic studies, health economics research, population health analyses and other applications where large sample sizes are critical. Researchers should account for the nuances of claims data during study design and analysis.

What are Electronic Health Records?

Electronic health records (EHRs) are a digital documentation of patient health information generated throughout clinical care. EHRs are maintained by healthcare organizations and contain various data elements documenting patient encounters, including (Hersh et al., 2013):

  • Demographics – age, gender, race, ethnicity, language
  • Medical history – conditions, diagnoses, allergies, immunizations, procedures
  • Medications – prescriptions, dosing instructions
  • Vital signs – blood pressure, heart rate, weight, height
  • Lab test results
  • Radiology images
  • Clinical notes – physician progress notes, discharge summaries

A key advantage of EHR data is its rich clinical context. While claims data only captures billing codes, EHRs include detailed narratives, quantitative measures, images and comprehensive documentation of each patient visit. This facilitates better understanding of disease presentation & progression, treatment rationale & response and patient complexity.

EHR databases aggregate records across large healthcare networks to compile real-world data on millions of patients. For instance, Vanderbilt University Medical Center’s Synthetic Derivative database contains de-identified medical records for over 3.7 million subjects and their BioVU® database contains over 310,000 DNA samples linked to de-identified medical records for genomics research (Roden et al., 2008).

EHR Data

Benefits of EHR Data

EHR data enables researchers to (Cowie et al., 2017):

  • Obtain granular clinical details beyond billing codes
  • Review physician notes and narratives for patient context
  • Link lab results, pathology reports, radiology images for additional phenotyping
  • Study unstructured data through natural language processing
  • Identify patient cohorts based on complex inclusion/exclusion criteria
  • Examine longitudinal disease patterns and treatment journeys

EHR data yields insights unattainable through claims data alone. The rich clinical details enable researchers to understand nuances in patient populations, disease manifestation and therapy response.

Challenges with EHR Data

While valued for its clinical context, EHR data also has some inherent limitations:

  • Incomplete or missing records if providers fail to properly document encounters
  • Incomplete records if patient receives care at multiple, unlinked healthcare networks
  • Inconsistent use of structured fields vs free text notes across systems
  • Lack of national standards in data formats, terminologies and definitions
  • Biased datasets dependent on specific health system patient population
  • Difficulty normalizing data across disparate EHR systems
  • Requires data science skills to analyze unstructured notes and documents
  • Requires clinical background to appropriately interpret unstructured notes and documents
  • More resource intensive for data extraction and processing compared to claims data

EHR data analysis requires specialized skills and infrastructure, especially to interpret unstructured data. Despite limitations, EHRs remain an invaluable data source on their own or as complements to other data sources like claims for comprehensive real-world evidence generation.

Integrating Claims and EHR Datasets

Given the complementary strengths of claims data and EHRs, there is significant value in integrating these datasets to conduct robust real-world studies. This can be accomplished by (Maro et al., 2019):

  • Linking claims and EHR data at the patient level via unique identifiers
  • Building cohorts based on diagnosis codes from claims data, then reviewing clinical data for each patient in the EHR
  • Using natural language processing on EHR notes to extract additional details not available in claims
  • Applying claims analysis algorithms on EHR data to identify lapses in care, adverse events, etc.
  • Incorporating prescription fills from claims with medication orders in EHRs to assess adherence
  • Using cost data from claims combined with clinical data for health economic studies

Major research networks like PCORnet have developed infrastructure to integrate claims and EHR data to support large-scale patient-centered outcomes research. When thoughtfully combined, these complementary data sources enable multifaceted real-world studies not possible using either source alone.

     Claims data and EHRs both provide invaluable real-world evidence on patient populations, but have distinct strengths and limitations. Claims data allows longitudinal analysis of diagnosis, procedure and prescription patterns at scale, but lacks clinical granularity. EHRs provide rich clinical context like physician notes, lab results and images, but lack continuity across health systems and data standardization. By integrating these sources, researchers can conduct robust real-world studies leveraging the advantages of both datasets. Careful consideration of the nuances of each data type allows generation of comprehensive real-world evidence to inform healthcare decisions and improve patient outcomes.

At NashBio, we use EHR data for most of our analytic activities because of its depth and additional clinical context, which helps us build the highest fidelity study populations for our clients.


Berger, M. L., Sox, H., Willke, R. J., Brixner, D. L., Eichler, H.-G., Goettsch, W., … Schneeweiss, S. (2017). Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety, 26(9), 1033–1039.

Cowie, M. R., Blomster, J. I., Curtis, L. H., Duclaux, S., Ford, I., Fritz, F., … Zalewski, A. (2017). Electronic health records to facilitate clinical research. Clinical Research in Cardiology, 106(1), 1–9.

Hersh, W. R., Weiner, M. G., Embi, P. J., Logan, J. R., Payne, P. R. O., Bernstam, E. V., Lehmann, H. P., Hripcsak, G., Hartzog, T. H., Cimino, J. J., & Saltz, J. H. (2013). Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical care, 51(8 0 3), S30–S37.

IBM Watson Health. (2022). IBM MarketScan Databases.

Maro, J. C., Platt, R., Holmes, J. H., Stang, P. E., Steiner, J. F., & Douglas, M. P. (2019). Design of a study to evaluate the comparative effectiveness of analytical methods to identify patients with irritable bowel syndrome using administrative claims data linked to electronic medical records. Pharmacoepidemiology and Drug Safety, 28(2), 149–157.

Pivovarov, R., Albers, D. J., Hripcsak, G., Sepulveda, J. L., & Elhadad, N. (2019). Temporal trends of hemoglobin A1c testing. Journal of the American Medical Informatics Association, 26(1), 41–48.

Roden, D. M., Pulley, J. M., Basford, M. A., Bernard, G. R., Clayton, E. W., Balser, J. R., & Masys, D. R. (2008). Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clinical pharmacology and therapeutics, 84(3), 362–369.