U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Curr Genomics
  • v.22(4); 2021 Dec 16

Machine Learning in Healthcare

Hafsa habehh.

1 Department of Health Informatics, Rutgers University School of Health Professions, 65 Bergen Street, Newark, NJ 07107, USA

Suril Gohel

Recent advancements in Artificial Intelligence (AI) and Machine Learning (ML) technology have brought on substantial strides in predicting and identifying health emergencies, disease populations, and disease state and immune response, amongst a few. Although, skepticism remains regarding the practical application and interpretation of results from ML-based approaches in healthcare settings, the inclusion of these approaches is increasing at a rapid pace. Here we provide a brief overview of machine learning-based approaches and learning algorithms including supervised, unsupervised, and reinforcement learning along with examples. Second, we discuss the application of ML in several healthcare fields, including radiology, genetics, electronic health records, and neuroimaging. We also briefly discuss the risks and challenges of ML application to healthcare such as system privacy and ethical concerns and provide suggestions for future applications.

1. INTRODUCTION

The application of machine learning dates back to the 1950s when Alan Turing proposed the first machine that can learn and become artificially intelligent [ 1 ]. Since its advent, machine learning has been used in various applications, ranging from security services through face detection [ 2 ] to increasing efficiency and decreasing risk in public transportation [ 3 , 4 ], and recently in various aspects of healthcare and biotechnology [ 5 - 10 ]. Artificial intelligence and machine learning have brought significant changes in business processes and have transformed day-to-day lives, and comparable transformations are anticipated in healthcare and medicine. Recent advancements in this area have displayed incredible progress and opportunity to disburden physicians and improve accuracy, prediction, and quality of care. Current machine learning advancements in healthcare have primarily served as a supportive role in a physician or analyst's ability to fulfill their roles, identify healthcare trends, and develop disease prediction models. In large medical organizations, machine learning-based approaches have also been implemented to achieve increased efficiency in the organization of electronic health records [ 11 ], identification of irregularities in the blood samples [ 5 ], organs [ 6 - 8 ], and bones [ 12 ] using medical imaging and monitoring, as well as in robot-assisted surgeries [ 9 , 13 ]. Machine learning applications have recently enabled the acceleration of testing and hospital response in the battle against COVID-19. Hospitals have been able to organize, share, and track patients, beds, rooms, ventilators, EHRs, and even staff during the pandemic using a deep learning system by GE called the Clinical Command Center [ 14 ]. Researchers have also used artificial intelligence for the identification of genetic sequences of SARS-CoV2 and the creation of vaccines as well as for their monitoring [ 15 ].

Many new developments emerge as the field of healthcare grows into the new world of technology. Artificial intelligence and machine learning-based approaches and applications are vital for the field’s progression, including increased speed of diagnosis, accuracy, and simplicity. The purpose of this review is to highlight the advantages and disadvantages of machine learning-based approaches in the healthcare industry. As the application of new machine learning technology takes the healthcare industry by storm, we aim to provide a brief overview of the various approaches to machine learning and highlight the fields where these approaches are primarily applied. We discuss their widespread use and future advancement opportunities in healthcare. We also address the ethical and logistical risks and challenges that occur with their application.

2. OVERVIEW OF ARTIFICIAL INTELLIGENCE

Although the terms machine learning, deep learning, and artificial intelligence are typically used interchangeably, they represent different sets of algorithms and learning processes. Artificial Intelligence (AI) is the umbrella term that refers to any computerized intelligence that learns and imitates human intelligence [ 16 ]. AI is most regarded for autonomous machines such as robots and self-driving cars, but it also permeates everyday applications, such as personalized advertisements and web searches. In recent years, AI development and application have made incredible strides and have been applied to many areas due to their higher levels of decision-making, accuracy, problem-solving capability, and computational skills [ 17 ]. In generally all development of AI algorithms, the data obtained is split into two groups, a training and test data set, to ensure reliable learning, representative populations, and unbiased predictions. As the name suggests, the training set is used for algorithm training that includes sets of characterizing data points (features) and corresponding predictions (in the case of supervised learning). The testing data set is new to the algorithm and is solely used to test the algorithm's abilities. This measure is taken to eliminate biases in the algorithm's testing by the training dataset [ 18 ]. Once an algorithm passes through a training and testing phase with acceptable results, the algorithms are implemented in healthcare settings. The application of AI is broad and has many applied sub-regions; here, we provide an overview of machine learning and deep learning, two of the several sub-regions of AI.

Machine learning encompasses several different algorithmic models and statistical methods to solve problems without specialized programming [ 19 ]. Several machine learning models are single-layered, therefore, large components of feature extraction and data processing are performed prior to inputting the data into the algorithm [ 20 ]. Without the extra layers, these machine learning algorithms require intense data preprocessing in order for the algorithms to determine accurate predictions and to avoid over-fitting or under-fitting the training dataset. Deep learning is a more elaborate sub- form of machine learning that utilizes layered artificial neural networks and provides increased accuracy and specificity with decreased interpretability [21]. The neuronal network method is characterized as the multilayer network that supports the connection between the artificial neurons, or units, in each layer with that of the layer before and after it [ 22 ]. These networks can learn, discern, and deduce from data on their own using these multilevel links for data processing, and the data are processed until the specialized results are achieved [21].

2.1. Types of Learning Approaches

Most of the machine learning and AI-based algorithms are built on different learning approaches. One subtype is supervised learning, which is used in training classification and prediction algorithms based on previous examples, or outputs. An important distinction for this learning technique is that the training set involves features and corresponding predictions, or outcomes. Simply put, the supervised learning approach generalizes information from the training set's features to construct a model that can correctly predict training-set outcomes and then uses the learned model to make predictions using the new features in the testing data set [ 20 ]. Decision Trees, Random Forest, Support Vector Machines, and Artificial Neural Networks are a few types of ML algorithms that implement supervised learning approaches. Decision tree algorithms form a decision support tool that begins with a single node and identifies the possible outcomes of that decision. The tree continues with the product of that decision and the following decisions until it reaches a final product [ 23 ]. Support Vector Machines (SVM) are known as classification algorithms that use supervised learning to classify features in two group problems by finding the largest margin hyperplane to separate the data and providing the best fit to organize it [ 16 , 24 ]. Artificial Neural Networks (ANNs) consist of an input layer, one or more hidden layers, and output layers, where functional unities/neurons in one layer are connected to every neuron in the layer before and after [ 25 ]. In healthcare, supervised machine learning approaches are widely implemented in disease prediction [ 26 ], identifying hospital outcomes [ 14 ], and image detection [ 27 ] to name a few.

Another subtype of AI-based learning approaches is unsupervised learning, which is typically used to evaluate data and to cluster applications. Unsupervised machine learning is usually purposeful in data analysis, stratification, and reduction rather than prediction. In general, unsupervised clustering methods use algorithms to group data that has not been classified or categorized into independent clusters. Although data preprocessing and feature extraction are done before the input in most forms of machine learning, this method allows for the extraction of features and explores possibilities of data clusters by identifying the underlying relationships or features in the data, then grouping them by their similarities [ 18 ]. Some unsupervised learning approaches include the k-Means algorithm, Deep Belief Networks, and Convolutional Neural Networks. The most common unsupervised learning algorithm is the k-Means algorithm that is used as a clustering method to identify the mean between groups within unlabeled datasets and create groups based on the mean [ 18 ]. A Deep Belief Network (DBN) is a multi-layer network consisting of intra-level connections useful for data retrieval that typically uses unsupervised learning and has many hidden layers tasked with feature detection and finding correlations in the data [ 28 , 29 ]. A Convolutional Neural Network (CNN) is a multilayer network that relies on feature recognition and identification and is useful for anomaly detection, image recognition, and identification [ 25 ]. Many unsupervised algorithms are used for clustering due to the lack of predetermined results and homogeneity in the data, and although the unsupervised methods are useful and quick, they are only semipopular in healthcare.

Reinforcement learning is another learning method that is neither supervised nor unsupervised learning. Similar to the mechanisms of conditioning in psychology, this learning depends on the sequences of rewards, and it forms a strategy for operation in a specific problem space. Reinforcement learning methods have the potential to influence their environment, are geared towards optimizing the error criterion, and have been described as the closest form of learning as seen in humans and animals [ 30 ]. Given the types of learning approaches, the selection of learning methods is relatively less complicated than the selection of algorithms and is usually dictated by the implementation purpose. A commonly used neural network that uses reinforcement learning is the Recurrent Neural Network (RNN). An RNN is one of the neural networks in which every artificial neuron is connected; the artificial neurons can receive inputs with delays in time and can reuse outputs from previous steps as input for a future step. It is useful for time series prediction, translation, speech recognition, rhythm learning, and music composition [ 25 ]. Although healthcare applications of reinforcement learning remain limited due to its need for structure, heterogeneous data, definition and implementation of rewards, and extensive computational resources, it still possesses the significant potential to bring major strides in healthcare.

Given the several types of machine learning and deep learning approaches, it is highly imperative to identify and implement an approach suitable specific to the healthcare application. Several factors, including the number of features [ 28 ], sample size [ 31 , 32 ], and data distributions [ 33 ], can have significant effects on the learning and prediction processes and should be considered.

3. AI IN HEALTHCARE

In healthcare, common machine learning advances have been evolving for years. The application of AI has the capacity to assist with case triage and diagnoses [ 26 ], enhance image scanning and segmentation [ 34 ], support decision making [ 11 ], predict the risk of disease [ 35 , 36 ], and in neuroimaging [ 37 ]. Here we provide a brief overview of current advances in AI applications to specific aspects of health science. Inclusion criteria for the applications mentioned are based on the higher availability of digital data used in the ML-based approaches and their clear implementation of learning approaches with clinical applications and experiments. In the current review, we focused on ML application to healthcare in the fields of electronic health records, medical imaging, and genetic engineering. These areas also represent healthcare’s “BIG” data, or the structured and unstructured data of the field, and have shown significant promise in relation to clinical applications.

Our search strategy is as follows: articles between June and December 2020, online libraries and journal databases including, but not limited to, and Academic OneFile, Gale, Nature, Sage Journals, Science Direct, PsycNet, and Pubmed were used. The compilation of articles and papers focused on the use of machine learning and artificial intelligence in healthcare as well as current and potential applications. Search terms included machine learning in healthcare, artificial intelligence medical imaging, BIG data and machine learning, machine learning in genomics, electronic health records, challenges of AI in healthcare, and medical applications of AI. Variations of these terms were used to ensure exhaustive search results. Searches were not limited by year or journal Table ​ 1 1 .

List of primary references.

Applied is defined as an algorithm or application that is currently available on a public or private platform to healthcare professionals. It also refers to applications that are currently applied in medical practices such as clinics, hospitals, etc . An experiment is defined as an algorithm or application that has been used in a research study. EHR: Electronic Health Records, SVM: Support Vector Machine, LSTM: Long Short-Term Memory Neural Network, CNN: Convolutional Neural Network, MLP: Multi-Layer perceptron Neural Network, RNN: Recurrent Neural network, DBN: Deep Belief Network, ANN: Artificial Neural Network, ML: Machine Learning.

3.1. Electronic Health Records

Electronic Health Records (EHRs), originally known as clinical information systems, were first introduced by Lockheed in the 1960s [ 38 ]. Since then, the systems have been reconstructed many times to create an industry-wide standard system. In 2009, the US federal government invested billions in promoting EHR implementation in all practices in an effort to improve the quality and efficiency of the work; this ultimately resulted in nearly 87 percent of office-based practices nationwide implementing EHRs in their systems by 2015 [ 39 ]. BIG data collected from EHR systems with structured feature data have been instrumental in deep learning applications, including medication refills and using patient history for predicting diagnoses [ 11 ]. This has resulted in significant improvement in data organization, accessibility, and quality of care and has helped physicians with diagnoses and treatments. The standardization of features across datasets has also allowed for increased access to health records for research purposes.

Considering the vital role that prediction plays in providing treatment, scientists have developed deep learning models for the diagnosis and prediction of clinical conditions using EHRs. In a recent research study, Liu, Zhang, and Razavian developed a deep learning algorithm using LSTM networks (reinforcement learning) and CNNs (supervised learning) to predict the onset of diseases, such as heart failure, kidney failure, and stroke. Unlike other prediction models, this algorithm used both structured data obtained from EHR and unstructured data contained in progress and diagnosis notes. As explained by Liu and colleagues, the inclusion of unstructured data within the model resulted in significant improvements in all the baseline accuracy measures, further indicating the versatility and robustness of such algorithms [ 40 ]. In another research study using deep neural network approaches, Ge and colleagues built a model to predict post-stroke pneumonia within 7 and 14-day periods. The model returned an Area under the ROC curve (AUC, a measure of model performance by combining sensitivity and specificity of a model) value of 92.8 percent for the 7-day predictions and 90.5 percent for the 14-day predictions [ 35 ], providing a highly accurate model predicting pneumonia following a stroke. In addition, several ML-based models have also been implemented to predict mortality in ICU patients. In one of such models, Ahmad and colleagues have shown great ability to predict mortality in paralytic ileus (PI, incomplete blockage of the intestine that prohibits the passage of food, eventually leading to a build-up and complete blockage of the intestines) patients using EHRs. The algorithm, named Statistically Robust Machine Learning-based Mortality Predictor (SRML-Mortality Predictor), showed an 81.30% accuracy rate in predicting mortality in PI patients [ 41 ]. Providing patients and practitioners with predicted mortality, through the use of EHR prediction algorithms, can allow them to make more educated clinical treatment decisions.

3.2. Medical Imaging

Given the digital nature of data and the presence of structured data formats such as DICOM (Digital Imaging and Communications in Medicine), medical imaging has seen significant strides with the implementation of machine learning-based approaches to several imaging modalities, including Computed Tomography (CT), Magnetic Resonance Imaging (MRI), X-Ray, Positron Emission Tomography (PET), Ultrasound, and more. Several ML-based models have been developed to identify tumors [ 42 , 43 ], lesions [ 44 ], fractures [ 45 , 46 ], and tears [ 47 , 48 ].

In a recent study, McKinney and colleagues have implemented a deep learning algorithm to detect tumors based on Mammograms in earlier stages of growth. In comparison to traditional screening techniques used to identify tumors, these deep learning-based screen techniques allow for the identification and location of tumors in earlier stages of breast cancer, allowing for a better rate of resection. In a direct comparison, the deep learning-based approach was able to outperform experienced radiologists by an AUC score of 11.5% [ 49 ]. Several other studies have also implemented ML-based approaches for breast cancer detection with variable success, including models by Wang and colleagues [ 50 ], Amrane and colleagues [ 51 ], and Ahmad and colleagues [ 52 ].

Similarly, in a recent study, Esteva and colleagues used CNN (unsupervised learning) to classify 2032 different skin diseases using dermoscopic images. An objective comparison of CNN classification with that of 21 board-certified dermatologists resulted in “on par” performance, further confirming the veracity of the results [ 7 ]. When implemented in conjunction with the average consumer mobile platform, this approach can result in ease of use and early diagnosis. In parallel, studies have also implemented ML-based approaches to quantify the progression of retinal diseases [ 51 - 54 ]. In one such study, Arcadu and colleagues applied a deep learning CNN to detect the aneurysms that cause vision loss due to the progression of Diabetic Retinopathy (DR) [ 55 ]. The CNN was also able to detect small and low contrast microaneurysms, although it was not explicitly designed to accomplish that task [55, 56]. Given that diabetic retinopathy is a common eye condition that affects around 60 percent of type 1 diabetes patients [ 57 ], it is difficult to detect in its preliminary stages. Early prediction obtained using a CNN approach has the potential to prevent and delay irreversible damage to patients' vision. X-rays have been used for decades to identify abnormalities in the chest cavity and lung disease, though an in-depth careful examination by a training radiologist is often required. In a recent study, Rajpurkar and colleagues conducted a retrospective study to explore the capacities of a 121-layer convolutional neural network to examine a collection of chest x-rays with various thoracic diseases and identify irregularities in an attempt to mimic the detection by trained radiologists [8]. In comparison, CNN's performance in the accuracy of identification observed was 81%, which was 2% higher than that of the radiologists. Although applied retrospectively, this study, along with CNNs developed by Tsai and Tao [ 58 ], Asif and colleagues [ 59 ], Liang and colleagues [ 60 ], and Lee and colleagues [ 61 ], indicates incredible support that these approaches can provide in examining and diagnosing illnesses, further reducing the burden on healthcare professionals.

ML-based approaches have also been implemented to predict and diagnose disease progression of neurodegenerative diseases, including Alzheimer's disease [ 37 , 62 ], Parkinson's disease [ 63 , 64 ], serious mental disorders including Psychosis, [ 65 , 66 ], depression [ 27 , 67 ], PTSD [ 68 ], and developmental disorders, including autism [ 69 , 70 ] and ADHD [ 71 , 72 ]. In one such study, Faturrahman and colleagues presented a higher-level model using DBNs(unsupervised learning) for predicting Alzheimer's Disease (AD) progression using structural MRI images, resulting in 91.76% accuracy, 90.59% sensitivity, and 92.96% specificity [ 37 ]. Although there is no cure for AD, early diagnosis can help implement strategies to delay the symptoms and degeneration. Using decision tree models and feature-rich data sets consisting of functional MRI, cognitive behavior scores, and age, Patel and colleagues developed a model to predict the diagnosis and treatment response for depression. The model scored 87.27% accuracy for diagnosis and 89.47% accuracy for treatment response [ 27 ]. This predictive diagnosis can help identify patients with depression and develop personalized treatment plans based on their responses. With the current ML applications in medical imaging, it is evident that its use has valuable implications for advancing the medical field due to its pronounced advantages in accuracy, classification, sensitivity, and specificity in prediction and diagnoses.

3.3. Genetic Engineering and Genomics

The discovery of the adaptive DNA system known as CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) has cultivated the field of genetic engineering [ 73 ]. This exploration of “programmable endonucleases” has simplified genetic engineering and has helped make the process of genetic modification and diagnosis easier, as well as dropping the cost of the procedure dramatically [ 74 ]. The recent application of CRISPR to Cas (CRISPR-associated protein) editing, such as Cas-9 [ 75 ] and Cas-13a [ 73 ], has changed genetic editing, though the tool is not perfect. Recently, several machine learning techniques for predicting off-target mutations in Cas9 gene editing have emerged. A new program developed by Jiecong Lin and Ka-Chun Wong has improved the quality of these machine learning predictions by using deep CNNs (AUC score: 97.2%) and deep FFs (AUC score: 97%) [ 76 ]. Considering the space for error and off-target mutations using the Cas9 tool, scientists are using Cas9 for developing activity predictors and more reliable Cas9 variants to reduce error. These models include higher accuracy and fidelity Cas9 variants [ 77 - 79 ], hyper-accurate Cas9 variants [ 80 ], and guide RNA design tools using deep learning [ 81 - 85 ].

Outside of CRISPR gene editing, O'Brien and colleagues have developed a service to provide efficiency in nucleotide editing using random forest algorithms (supervised learning) to investigate how different nucleotide compositions influence the HDR (homology-directed repair) efficiency [ 86 ]. They developed the Computational Universal Nucleotide Editor (CUNE), used to find the most efficient method to identify a precise location to enter a specific point mutation and predict HDR efficiency. Additionally, Pan and colleagues have developed a model for prediction in gene editing named ToxDL that uses a CNN approach to predict protein toxicity in-vivo using only the sequence data [ 87 ]. Another branch of genetic engineering, pharmacogenomics, has also made significant strides in the use of AI and machine learning to determine stable doses of medications that have become popular [ 88 - 90 ]. In one such study, Tang and colleagues implemented an ML-based approach to determine a stable Tacrolimus dose (the immunosuppressive drug) for patients who received a renal transplant to reduce the risk of acute rejection [ 10 ]. The use of machine learning in pharmacogenomics has recently been applied in psychiatry [ 90 ], oncology [ 91 ], bariatrics [ 92 ], and neurology [ 93 ].

Machine learning applications of genetic engineering have also been instrumental in the fight against COVID 19. In a recent study, Malone and colleagues utilized software based on machine learning algorithms to “predict which antigens have the required features of HLA-binding, processing, presentation to the cell surface, and the potential to be recognized by T cells to be good clinical targets for immunotherapy” [ 15 ]. The use of immunogenicity predictions from this software, along with the presentation of antigen to infected host-cells, allowed the team to successfully profile the “entire SARS-CoV2 proteome” as well as epitope hotspots. These discoveries help predict blueprints for designing universal vaccines against the virus that can be adapted across the global population.

4. RISKS AND CHALLENGES

While machine learning-based applications in healthcare present unique and progressive opportunities, they also raise unique risk factors, challenges, and healthy skepticism. Here we discuss the main risk factors including the probability of error in prediction and its impact, the vulnerability of the systems' protection and privacy, and even the lack of data availability to obtain reproducible results. Some of the challenges include ethical concerns, loss of the personal element of healthcare, and the interpretability and practical application of the approaches to bedside setting.

One of the most important risks of machine learning-based algorithms is the reliance on the probabilistic distribution and the probability of error in diagnosis and prediction. This also gives rise to a healthy skepticism related to the validity and veracity of predictions from ML-based approaches. Even though the probability of error and reliance on probability is deep-rooted in the various aspects of health care, the implications of ML-based approaches resulting in a human fatality are severe. One solution is to subject these machine learning-based approaches to strict institutional and legal approval by several organizations before their application [ 94 , 95 ]. Another approach that can be implemented is the human intervention and oversight from an experienced healthcare worker in highly sensitive applications to avoid false-positive or false-negative diagnoses ( e.g. , diagnosis of depression or breast cancer). The inclusion of present healthcare professionals in developing and implementing these approaches may increase adaption rates and decrease concerns related to fewer employment opportunities for humans or the shrinking of the workforce [ 96 ].

Another risk associated with the application of ML and deep learning algorithms to health care is the availability of high-quality training and testing data with large enough sample sizes to ensure high reliability and reproducibility of the predictions. Given that the ML and deep learning-based approaches 'learn' from data, the importance of quality data cannot be stressed enough. In addition, the large amounts of feature-rich data required for these learning networks and approaches are not readily available and may also represent a narrow distribution of the population sample. Moreover, in several healthcare segments, data collected are incomplete, heterogeneous, and have a significantly higher number of features than the number of samples. These challenges should be taken into great consideration when developing and interpreting the results of ML-based approaches. The open science and recent push towards research data sharing may assist in overcoming such challenges. One should also consider the risk associated with privacy as well as ethical implications of the application of ML-based approaches to healthcare. With the understanding that these approaches require large-scale, easily expandable data storage, and significantly high computing power, several ML-based approaches are developed and implemented using cloud-based technologies. Given the sensitive nature of healthcare data along with privacy concerns, increased data security and accountability should be one of the first aspects to be considered well before model development.

With respect to ethical concerns, researchers working on applying ML-based approaches to healthcare can readily learn from the field of genetic engineering which has undergone extensive ethical debate. The controversy surrounding the use of genetic engineering to create long-lasting genetic advancements and treatments is a continuous discourse. Identification and editing of injurious genetic mutations, such as the HTT mutation that causes Huntington’s disease, may provide life-altering treatment for harmful diseases [ 97 ]. Contrarily, creating treatments that alter the individual’s genome, as well as that of their offspring, while it is still inaccessible due to costs, may worsen the socio-economic divide for populations that are unable to afford such care [ 98 ]. Recently, there has been an emergence of guidelines for the development of AI machinery. In 2019, Singapore proposed a Model Artificial Intelligence Governance Framework to guide private sector organizations on developing and using AI ethically [ 99 ]. The US Administration has also released an executive order to regulate AI development and “maintain American leadership in artificial intelligence” [ 100 ]. These guidelines and regulations, though strict, have been put forth to ensure ethical research conduct and development.

An important challenge with ML application to healthcare is associated with the interpretation and clinical applicability of the results. Given the complex structure of ML-based approaches, especially deep learning-based methods, it becomes incredibly complex to distinguish and identify the original features' contribution towards the prediction. Although this may not present significant concern in other applications of ML (such as web searches), lack of transparency has created a huge barrier for the adaptability of ML-based approaches in healthcare. As clearly understood in healthcare, the solution strategy is as important as the solution itself. There must be a systematic shift towards identifying and quantifying underlying data features used for prediction. The involvement of physicians and healthcare professionals in the development, implementation, and testing of ML-based approaches may also help improve the adoption rates. Additionally, although there is healthy skepticism related to the potential of a decreased personal relationship between a patient and PCP due to increased implementation of ML-based approaches, they represent a unique opportunity to increase engagement. Studies have shown that the physician-patient relationship has already become a fading concept, and nearly 25 percent of Americans do not have a PCP [ 101 ]. Here, ML can provide unique opportunities to increase engagement where patients discuss the results of potential diagnoses and increase the efficiency of outreach programs. Early prognosis due to ML-based approaches may also help patients develop a healthy lifestyle in consultations with their PCPs. Finally, a physician-focused survey found that 56 percent of physicians were spending 16 minutes or less with their patients, and 5 percent of them spent less than 9 minutes [ 102 ]. The application of AI approaches in diagnoses and symptom monitoring can ease stress and give physicians more personal time with their patients, thus improving patient satisfaction and outcomes.

While the overview demonstrates how much progress has been achieved with machine learning, there continues to be potential for widescale advancement in the future. Many of the current machine learning advancements in healthcare aim to support the physician’s or specialist's ability to provide a more effective treatment to patients with increased quality, speed, and precision. The challenges of developing ML algorithms can be solved by developing and implementing improvements in data collection, storage, and dissemination or by creating algorithms to process unstructured data to address the lack of data availability. Future applications can also bring forth inexpensive forms of medical imaging and affordable medical examinations, potentially ending health disparities and creating more accessible services for countries and lower-income populations. Scientists expect advancement in the prediction of personalized drug response, optimization of medication selection and dosage, and an application of genetic modification to provide treatment for genetic disorders and mutations [103]. With its application, ML can augment the role of physicians and redefine patient care. While the risks and challenges of the future application are addressed and corrected, the current ML algorithms can provide an excellent framework for future advancements and applications of ML in healthcare.

ACKNOWLEDGEMENTS

Declared none.

CONSENT FOR PUBLICATION

Not applicable.

This work was partly supported by NIH/NCATS UL1TR003017. This work is solely the responsibility of the authors and does not necessarily represent the official views of the NIH, NCATS.

CONFLICT OF INTEREST

The authors declare no conflict of interest, financial or otherwise.

Machine Learning in Healthcare

Affiliation.

  • 1 Department of Health Informatics, Rutgers University School of Health Professions, 65 Bergen Street, Newark, NJ 07107, USA.
  • PMID: 35273459
  • PMCID: PMC8822225
  • DOI: 10.2174/1389202922666210705124359

Recent advancements in Artificial Intelligence (AI) and Machine Learning (ML) technology have brought on substantial strides in predicting and identifying health emergencies, disease populations, and disease state and immune response, amongst a few. Although, skepticism remains regarding the practical application and interpretation of results from ML-based approaches in healthcare settings, the inclusion of these approaches is increasing at a rapid pace. Here we provide a brief overview of machine learning-based approaches and learning algorithms including supervised, unsupervised, and reinforcement learning along with examples. Second, we discuss the application of ML in several healthcare fields, including radiology, genetics, electronic health records, and neuroimaging. We also briefly discuss the risks and challenges of ML application to healthcare such as system privacy and ethical concerns and provide suggestions for future applications.

Keywords: EHR; Machine learning; artificial intelligence; genomics; healthcare; support vector machine.

© 2021 Bentham Science Publishers.

Publication types

  • Open access
  • Published: 16 August 2021

The role of machine learning in clinical research: transforming the future of evidence generation

  • E. Hope Weissler   ORCID: orcid.org/0000-0002-8442-6150 1 ,
  • Tristan Naumann 2 ,
  • Tomas Andersson 3 ,
  • Rajesh Ranganath 4 ,
  • Olivier Elemento 5 ,
  • Yuan Luo 6 ,
  • Daniel F. Freitag 7 ,
  • James Benoit 8 ,
  • Michael C. Hughes 9 ,
  • Faisal Khan 3 ,
  • Paul Slater 10 ,
  • Khader Shameer 3 ,
  • Matthew Roe 11 ,
  • Emmette Hutchison 3 ,
  • Scott H. Kollins 1 ,
  • Uli Broedl 12 ,
  • Zhaoling Meng 13 ,
  • Jennifer L. Wong 14 ,
  • Lesley Curtis 1 ,
  • Erich Huang 1 , 15 &
  • Marzyeh Ghassemi 16 , 17 , 18 , 19  

Trials volume  22 , Article number:  537 ( 2021 ) Cite this article

41k Accesses

69 Citations

63 Altmetric

Metrics details

A Correction to this article was published on 06 September 2021

This article has been updated

Interest in the application of machine learning (ML) to the design, conduct, and analysis of clinical trials has grown, but the evidence base for such applications has not been surveyed. This manuscript reviews the proceedings of a multi-stakeholder conference to discuss the current and future state of ML for clinical research. Key areas of clinical trial methodology in which ML holds particular promise and priority areas for further investigation are presented alongside a narrative review of evidence supporting the use of ML across the clinical trial spectrum.

Conference attendees included stakeholders, such as biomedical and ML researchers, representatives from the US Food and Drug Administration (FDA), artificial intelligence technology and data analytics companies, non-profit organizations, patient advocacy groups, and pharmaceutical companies. ML contributions to clinical research were highlighted in the pre-trial phase, cohort selection and participant management, and data collection and analysis. A particular focus was paid to the operational and philosophical barriers to ML in clinical research. Peer-reviewed evidence was noted to be lacking in several areas.

Conclusions

ML holds great promise for improving the efficiency and quality of clinical research, but substantial barriers remain, the surmounting of which will require addressing significant gaps in evidence.

Peer Review reports

Interest in machine learning (ML) for healthcare has increased rapidly over the last 10 years. Though the academic discipline of ML has existed since the mid-twentieth century, improved computing resources, data availability, novel methods, and increasingly diverse technical talent have accelerated the application of ML to healthcare. Much of this attention has focused on applications of ML in healthcare delivery ; however, applications of ML that facilitate clinical research are less frequently discussed in the academic and lay press (Fig. 1 ). Clinical research is a wide-ranging field, with preclinical investigation and observational analyses leading to traditional trials and trials with pragmatic elements, which in turn spur clinical registries and further implementation work. While indispensable to improving healthcare and outcomes, clinical research as currently conducted is complex, labor intensive, expensive, and may be prone to unexpected errors and biases that can, at times, threaten its successful application, implementation, and acceptance.

figure 1

The number of clinical practice–related publications was determined by searching “(“machine learning” or “artificial intelligence”) and (“healthcare”).” The number of healthcare-related publications was determined by searching “(“machine learning” or “artificial intelligence”) and (“healthcare”)”, and the number of clinical research–related publications was determined by searching “(“machine learning” or “artificial intelligence”) and (“clinical research”).”

Machine learning has the potential to help improve the success, generalizability, patient-centeredness, and efficiency of clinical trials. Various ML approaches are available for managing large and heterogeneous sources of data, identifying intricate and occult patterns, and predicting complex outcomes. As a result, ML has value to add across the spectrum of clinical trials, from preclinical drug discovery to pre-trial planning through study execution to data management and analysis (Fig. 2 ). Despite the relative lack of academic and lay publications focused on ML-enabled clinical research (vìs-a-vìs the attention to ML in care delivery), the profusion of established and start-up companies devoting significant resources to the area indicates a high level of interest in, and burgeoning attempts to make use of, ML application to clinical research, and specifically clinical trials.

figure 2

Areas of machine learning contribution to clinical research. Machine learning has the potential to contribute to clinical research through increasing the power and efficiency of pre-trial basic/translational research and enhancing the planning, conduct, and analysis of clinical trials

Key ML terms and principles may be found in Table 1 . Many of the ML applications discussed in this article rely on deep neural networks, a subtype of ML in which interactions between multiple (sometimes many) hidden layers of the mathematical model enable complex, high-dimensional tasks, such as natural language processing, optical character recognition, and unsupervised learning. In January 2020, a diverse group of stakeholders, including leading biomedical and ML researchers, along with representatives from the US Food and Drug Administration (FDA), artificial intelligence technology and data analytics companies, non-profit organizations, patient advocacy groups, and pharmaceutical companies convened in Washington, DC, to discuss the role of ML in clinical research. In the setting of relatively scarce published data about ML application to clinical research, the attendees at this meeting offered significant personal, institutional, corporate, and regulatory experience pertaining to ML for clinical research. Attendees gave presentations in their areas of expertise, and effort was made to invite talks covering the entire spectrum of clinical research with presenters from multiple stakeholder groups for each topic. Subjects about which presentations were elicited in advance were intentionally broad and included current and planned applications of ML to clinical research, guidelines for the successful integration of ML into clinical research, and approaches to overcoming the barriers to implementation. Regular discussion periods generated additional areas of interest and concern and were moderated jointly by experts in ML, clinical research, and patient care. During the discussion periods, attendees focused on current issues in ML, including data biases, logistics of prospective validation, and the ethical issues associated with machines making decisions in a research context. This article provides a summary of the conference proceedings, outlining ways in which ML is currently being used for various clinical research applications in addition to possible future opportunities. It was generated through a collaborative writing process in which drafts were iterated through continued debate about unresolved issues from the conference itself. For many of the topics covered, no consensus about best practices was reached, and a diversity of opinions is conveyed in those instances. This article also serves as a call for collaboration between clinical researchers, ML experts, and other stakeholders from academia and industry in order to overcome the significant remaining barriers to its use, helping ML in clinical research to best serve all stakeholders.

The role of ML in preclinical drug discovery and development research

Successful clinical trials require significant preclinical investigation and planning, during which promising candidate molecules and targets are identified and the investigational strategy to achieve regulatory approval is defined. Missteps in this phase can delay the identification of promising drugs or doom clinical trials to eventual failure. ML can help researchers leverage previous and ongoing research to decrease the inefficiencies of the preclinical process.

Drug target identification, candidate molecule generation, and mechanism elucidation

ML can streamline the process and increase the success of drug target identification and candidate molecule generation through synthesis of massive amounts of existing research, elucidation of drug mechanisms, and predictive modeling of protein structures and future drug target interactions [ 1 ]. Fauqueur et al. demonstrated the ability to identify specific types of gene-disease relationships from large databases even when relevant data-points were sparse [ 2 ], while Jia et al. were able to extract drug-gene-mutation interactions from the text of scientific manuscripts [ 3 ]. This work, along with other efforts to render extremely large amounts of biomedical data interpretable by humans [ 4 , 5 ], helps researchers leverage and avoid duplicating prior work in order to target more promising avenues for further investigation. Once promising areas of investigation have been identified, ML also has a role to play in the generation of possible candidate molecules, for instance through use of a gated graph neural network to optimize molecules within the constraints of a target biological system [ 6 ]. In situations in which a drug candidate performs differently in vivo than expected, ML can synthesize and analyze enormous amounts of data to better elucidate the drug’s mechanism, as Madhukar et al. showed by applying a Bayesian ML approach to an anti-cancer compound [ 7 ]. This type of work helps increase the chance that drugs are tested in populations most likely to benefit from them. In the case of the drug evaluated by Madhukar et al., a better understanding of its mechanism facilitated new clinical trials in a cancer type (pheochromocytoma) more likely to respond to the drug (rather than prostate and endometrial cancers, among others).

Interpretation of large amounts of highly dimensional data generated during in vitro translational research (including benchtop biological, chemical, and biochemical investigation) informs the choice of certain next steps over others, but this process of interpretation and integration is complex and prone to bias and error. Aspuru-Guzik has led several successful efforts to use experimental output as input for autonomous ML-powered laboratories, integrating ML into the planning, interpretation, and synthesis phases of drug development [ 8 , 9 ]. More recently, products of ML-enabled drug development have approached human testing. For example, an obsessive-compulsive personality disorder drug purportedly developed using AI-based methods is scheduled to begin phase I trials this year. The lay press reports that the drug was selected from among only 250 candidates and developed in only 12 months compared with the 2000+ candidates and nearly five years of development more typically required [ 10 ]. However, due to the lack of peer-reviewed publications about the development of this drug, the details of its development cannot be confirmed or leveraged for future work.

Clinical study protocol optimization

As therapeutic compounds approach human trials, ML has a role to play in maximizing the success and efficiency of trials during the planning phase through application of simulation techniques to large amounts of data from prior trials in order to facilitate trial protocol development. For instance, study simulation may optimize the choice of treatment regimens for testing, as shown in a reinforcement learning approaches to Alzheimer’s disease and to non-small cell lung cancer [ 11 , 12 ]. A start-up company called Trials.AI allows investigators to upload protocols and uses natural language processing to identify potential pitfalls and barriers to successful trial completion (such as inclusion/exclusion criteria or outcome measures) [ 13 ]. Unfortunately, performance of these example models has not been evaluated in a peer-reviewed manner, and they therefore offer only conceptual promise that ML in research planning can help ensure that a given trial design is optimally suited to the stakeholders’ needs.

In summary, there are clear opportunities to use ML to improve the efficiency and yield of preclinical investigation and clinical trial planning. However, most peer-reviewed reports of ML use in this capacity focus on preclinical research and development rather than clinical trial planning. This may be due to the greater availability of suitable large, highly dimensional datasets in translational settings in addition to greater potential costs, risks, and regulatory hurdles associated with ML use in clinical trial settings. Peer-reviewed evidence of ML application to clinical trial planning is needed in order to overcome these hurdles.

The role of ML in clinical trial participant management

Clinical trial participant management includes the selection of target patient populations, patient recruiting, and participant retention. Unfortunately, despite significant resources generally being devoted to participant management, including time, planning, and trial coordinator effort, patient drop-out and non-adherence often cause studies to exceed allowable time or cost or fail to produce useable data. In fact, it has been estimated that between 33.6 and 52.4% of phase 1–3 clinical trials that support drug development fail to proceed to the next trial phase, leading to a 13.8% overall chance that a drug tested in phase I reaches approval [ 14 ]. ML approaches can facilitate more efficient and fair participant identification, recruitment, and retention.

Selection of patient populations for investigation

Improved selection of specific patient populations for trials may decrease the sample size required to observe a significant effect. Put another way, improvements to patient population selection may decrease the number of patients exposed to interventions from which they are unlikely to derive benefit. This area remains challenging as prior work has discovered that for every 1 intended response, there are 3 to 24 non-responders for the top medications, resulting in a large number of patients who receive harmful side effects over the intended effect [ 15 ]. In addition to facilitating patient population selection through the rapid analysis of large databases of prior research (as discussed above), unsupervised ML of patient populations can identify patterns in patient features that can be used to select patient phenotypes that are most likely to benefit from the proposed drug or intervention [ 16 ]. Unstructured data is critical to phenotyping and identifying representative cohorts, indicating that considering additional data for patients is a crucial step toward identifying robust, representative cohorts [ 17 ]. For example, unsupervised learning of electronic health record (EHR) and genetic data from 11,210 patients elucidated three different subtypes of diabetes mellitus type II with distinct phenotypic expressions, each of which may have a different need for and response to a candidate therapy [ 18 ]. Bullfrog AI is a start-up that has sought to capitalize on the promise of targeted patient population selection, analyzing clinical trial data sets “to predict which patients will respond to a particular therapy in development, thereby improving inclusion/exclusion criteria and ensuring primary study outcomes are achieved” [ 19 ]. Though appealing in principle, this unsupported claim conflates outcome prediction (which is unlikely to succeed and runs counter to the intent of clinical research) with cohort selection (which would ideally identify patients on the basis of therapeutically relevant subtypes). Successfully identifying more selective patient populations does carry potential pitfalls: first, trials may be less likely to generate important negative data about subgroups that would not benefit from the intervention; and second, trials may miss subgroups who would have benefitted from the intervention, but whom the ML model missed. These potential pitfalls may be more likely to affect rural, remote, or underserved patient subgroups with more limited healthcare interactions. These two pitfalls carry possible implications for drug/device development regulatory approval and commercialization, as pivotal trials in more highly selected, and less representative, patient subgroups may require balancing the benefits of greater trial success with the drawbacks of more limited indications for drug/device use.

Participant identification and recruitment

Once the specific cohort has been selected, natural language processing (NLP) has shown promise in identification of patients matching the desired phenotype, which is otherwise a labor-intensive process. For instance, a cross-modal inference learning model algorithm jointly encodes enrollment criteria (text) and patient records (tabular data) into a shared latent space, matching patients to trials using EHR data in a significantly more efficient manner than other machine learning approaches [ 20 ]. Some commercial entities offer similar services, including Mendel.AI and Deep6AI, though peer-reviewed evidence of their development and performance metrics is unavailable, raising questions about how these approaches perform [ 21 , 22 ]. A potential opportunity of this approach is that it allows trialists to avoid relying on the completeness of structured data fields for participant identification, which has been shown to significantly bias trial cohorts [ 23 , 24 ]. Unfortunately, to the extent that novel ML approaches to patient identification rely on EHRs, biases in the EHR data may affect the algorithms’ performances, leading to replacement of one source of bias (underlying the completeness of structured data) with another (underlying the generation of EHR documentation).

Participant retention, monitoring, and protocol adherence

Two broad approaches are available to improve participant retention and protocol adherence using ML-assisted methods. The first is to use ML to collect and analyze large amounts of data to identify and intervene upon participants at high risk of study non-compliance. The second approach is to use ML to decrease participant study burden and thereby improve participants’ experiences.

AiCure is a commercial entity focused on protocol adherence using facial recognition technology to ensure patients take the assigned medication. AiCure was demonstrated to be more effective than a modified directly observed therapy strategy at detecting and improving patient adherence in both a schizophrenia trial and an anticoagulation trial among patients with a history of recent stroke [ 25 , 26 ]. Unfortunately, AiCure’s model development and validation process has not been published, heightening concerns that it may perform differently in different patient subgroups, as has been demonstrated in other areas of computer vision [ 27 ]. Furthermore, these approaches, though promising, may encounter a potential barrier to implementation because their perceived invasiveness of privacy may not be acceptable to all research participants and because selecting patients with access to and comfort with the necessary devices and technology may introduce bias.

The other approach to improving participant retention uses ML to reduce the trial burden for participants using passive data collection techniques (methods will be discussed further in the “Data collection and management” section) and by extracting more information from available data generated during clinical practice and/or by study activities. Information created during routine clinical care can be processed using ML methods to yield data for investigational purposes. For instance, generative adversarial network modeling of slides stained with hematoxylin and eosin in the standard clinical fashion can detect which patients require more intensive and expensive multiplexed imaging, rather than subjecting all participants to that added burden [ 28 ]. NLP can also facilitate repurposing of clinical documentation for study use, such as auto-populating study case report forms, often through reliance on the Unified Medical Language System [ 29 , 30 ]. Patients also create valuable content outside of the clinical trial context that ML can process into study data to reduce the burden of data collection for trial participants, such as natural language processing of social media posts to identify serious drug reactions with high fidelity [ 31 ]. Patient data from wearable devices have proven to be able to correlate participant activity with the International Parkinson and Movement Disorders Society Unified Parkinson’s Disease Rating Scale, distinguish between neuropsychiatric symptomatology patterns, and identify patient falls [ 32 , 33 , 34 ].

In summary, although ML and NLP have shown promise across a broad range of activities related to improving the management of participants in clinical trials, the implications of these applications of ML/NLP in regard to clinical trial quality and participant experience are unclear. Studies comparing different approaches to participant management are a necessary next step toward identifying best practices.

Data collection and management

The use of ML in clinical trials can change the data collection, management, and analysis techniques required. However, ML methods can help address some of the difficulties associated with missing data and collecting real-world data.

Collection, processing, and management of data from wearable and other smart devices

Patient-generated health data from wearable and other mobile/electronic devices can supplement or even replace study visits and their associated traditional data collection in certain situations. Wearables and other devices may enable the validation and use of new, patient-centered biomarkers. Developing new “digital biomarkers” from the data collected by a mobile device’s various sensors (such as cameras, audio recorders, accelerometers, and photoplethysmograms) often requires ML processing to derive actionable insights because the data yielded from these devices can be sparse as well as variable in quality, availability, and synchronicity. Using the relatively large and complex data yielded by wearables and other devices for research purposes therefore requires specialized data collection, storage, validation, and analysis techniques [ 34 , 35 , 36 , 37 ]. For instance, a deep neural network was used to process input from a mobile single-lead electrocardiogram platform [ 38 ], a random forest model was used to process audio output from patients with Parkinson’s disease [ 39 ], and a recurrent neural network was used to process accelerometer data from patients with atopic dermatitis [ 40 ]. These novel digital biomarkers may facilitate the efficient conduct and patient-centeredness of clinical trials, but this approach carries potential pitfalls. As has been shown to occur with an electrocardiogram classification model, ML processing of wearable sensor output to derive research endpoints introduces the possibility of corrupt results if the ML model is subverted by intentionally or unintentionally modified sensor data (though this risk exists with any data regardless of processing technique) [ 41 ]. Because of the complexity involved, software intended to diagnose, monitor, or treat medical conditions is regulated by the FDA, and the FDA has processes and guidance related to biomarker validation and qualification for use in regulatory trials.

Beyond the development of novel digital biomarkers, other device-related opportunities in patient centricity include the ability to export data and analytics back to participants to facilitate education and insight. Barriers to implementation of ML processing of device data include better defining how previously validated clinical endpoints and patient-centric digital biomarkers overlap as well as understanding participant opinions about privacy in relation to the sharing and use of device data. FDA approval of novel biomarkers will also be required. Researchers interested in leveraging the power of these devices must explain to patients their risks and benefits both for ethical and privacy-related reasons and because implementation without addressing participant concerns has the potential to worsen participant recruitment and retention [ 42 ].

Study data collection, verification, and surveillance

An appealing application of ML, specifically NLP, to study data management is to automate data collection into case report forms, decreasing the time, expense, and potential for error associated with human data extraction, whether in prospective trials or retrospective reviews. Though this use requires overcoming variable data structures and provenances, it has shown early promise in cancer [ 43 , 44 ], epilepsy [ 30 ], and depression [ 45 ], among other areas [ 29 ]. Regardless of how data have been collected, ML can power risk-based monitoring approaches to clinical trial surveillance, enabling the prevention and/or early detection of site failure, fraud, and data inconsistencies or incompleteness that may delay database lock and subsequent analysis. For instance, even when humans collect data into case report forms (often transmitted in PDF form), the adequacy of the collected data for outcome ascertainment can be assessed by combining optical character recognition with NLP [ 46 ]. Suspicious data patterns in clinical trials, or incorrect data in observational studies, can be identified by applying auto-encoders to distinguish plausible from implausible data [ 47 ].

Endpoint identification, adjudication, and detection of safety signals

ML can also be applied to data processing. Semi-automated endpoint identification and adjudication offers the potential to reduce time, cost, and complexity compared with the current approach of manual adjudication of events by a committee of clinicians, because while endpoint adjudication has traditionally been a labor-intensive process, sorting and classifying events lies well within the capabilities of ML. For instance, IQVIA Inc. has described the ability to automatically process some adverse events related to drug therapies using a combination of optical character recognition and NLP, though this technique has not been described in peer-reviewed publications [ 48 ]. A potential barrier to implementation of semi-automated event adjudication is that endpoint definitions and the data required to support them often change from trial to trial, which theoretically requires re-training a classification model for each new trial (which is not a viable approach). More recently, efforts have been made to standardize outcomes in the field of cardiovascular research, though not all trials adhere to these outcomes. Trial data have not been pooled to facilitate model training for cardiovascular endpoints, and most fields have not yet undertaken similar efforts [ 49 ]. Further efforts in this area will require true consensus about event definitions, use of consensus definitions, and a willingness of stakeholders to share adequate data for model training from across multiple trials.

Approaches to missing data

ML can be used in several different ways to address the problem of missing data, across multiple causes for data missingness, data-related assumptions and goals, and data collection and intended analytic methods. Possible goals may be to impute specific estimates of the missing covariate values directly or to average over many possible values from some learned distribution to compute other quantities of interest. While the latest methods are evolving and more systematic comparisons are needed, some early evidence suggests more complex ML methods may not always be of benefit over simpler imputation methods, such as population mean imputation [ 50 ]. Applications of missing value techniques include analysis of sparse datasets, such as registries, EHR data, ergonomic data, and data from wearable devices [ 51 , 52 , 53 , 54 ]. Although these techniques can help mitigate the negative effects of data missingness or scarcity, over-reliance on data augmentation methods may lead to the development of models with limited applicability to new, imperfect datasets. Therefore, a more meaningful approach would be to apply ML to improve data collection during the conduct of research itself.

Data analysis

Data collected in clinical trials, registries, and clinical practices are fertile sources for hypothesis generation, risk modeling, and counterfactual simulation, and ML is well suited for these efforts. For instance, unsupervised learning can identify phenotypic clusters in real-world data that can be further explored in clinical trials [ 55 , 56 ]. Furthermore, ML can potentially improve the ubiquitous practice of secondary trial analyses by more powerfully identifying treatment heterogeneity while still providing some protection (although incomplete) against false-positive discoveries, uncovering more promising avenues for future study [ 57 , 58 ]. Additionally, ML is effectively used to generate risk predictions in retrospective datasets that can subsequently be prospectively validated. For instance, using a random forest model in COMPANION trial data, researchers were able to improve discrimination between patients who would do better or worse following cardiac resynchronization therapy compared with a multivariable logistic regression [ 59 ]. This demonstrates the ability of random forests to model interactions between features that are not captured by simpler models.

While predictive modeling is an important and necessary task, the derivation of real-world evidence from real-world data (i.e., making causal inferences) remains a highly sought-after (and very difficult) goal toward which ML offers some promise. Proposed techniques include optimal discriminant analysis, targeted maximum likelihood estimation, and ML-powered propensity score weighting [ 60 , 61 , 62 , 63 , 64 ]. A particularly intriguing technique involves use of ML to enable counterfactual policy estimation, in which existing data can be used to make predictions about outcomes under circumstances that do not yet, or could not, exist [ 65 ]. For instance, trees of predictors can offer survival estimates for heart failure patients under the conditions of receiving or not receiving a heart transplant and reinforcement learning suggests improved treatment policies on the basis of prior sub-optimal treatments and outcomes [ 66 , 67 ]. Unfortunately, major barriers to implementation are a lack of interoperability between EHR data structures and fraught data sharing agreements that limit the amount of data available for model training [ 68 ].

In summary, there are many effective ML approaches to clinical trial data management, processing, and analysis but fewer techniques for improving the quality of data as they are generated and collected. As data availability and quality are the foundations of ML approaches, the conduct of high-quality trials remains of utmost importance to enable higher-level ML processing.

Barriers to the integration of ML techniques in clinical research

Both operational and philosophical barriers limit the harnessing of the full potential of ML for clinical research. ML in clinical research is a high-risk proposition due to the potential to propagate errors or biases through multiple research contexts and into the corpus of biomedical evidence due to the use of flawed models; however, as previously discussed, ML offers promising ways to improve the quality and efficiency of clinical research for patients and other stakeholders. Both the operational and philosophical barriers to ML integration require attention at each stage of model development and use to overcome hurdles while maximizing stakeholder confidence in the process and its results. Operational barriers to ML integration in clinical research can aggravate and reinforce philosophical concerns if not managed in a robust and transparent manner. For instance, inadequate training data and poor model calibration can lead to racial bias in model application, such as has been noted in ML for melanoma identification [ 27 ]. Stakeholders, including regulatory agencies, funding sources, researchers, participants, and industry partners, must collaborate to fully integrate ML into clinical research. The wider ML community espouses “FAT (fairness, accountability, and transparency) ML” principles that also include responsibility, explainability, accuracy, auditability, and fairness and that should be applied to ML in clinical research, as discussed further.

Operational barriers to ML in clinical research

The development of ML algorithms and their deployment for clinical research use is a multi-stage, multi-disciplinary process. The first step is to assemble a team with the clinical and ML domain expertise necessary for success. Failing to assemble such a team and to communicate openly within the team increases the risks of either developing a model that distorts clinical reality or using an ML technique that is inappropriate to the available data and research question at hand [ 69 ]. For instance, a model to predict mortality created without any clinical team members may identify intubation as predictive of mortality, which is certainly true but likely clinically useless. Collaboration is necessary and valuable for both the data science and clinical science components of the team but may require additional up-front, cross-disciplinary training, transparency, and trust to fully operationalize.

The choice and availability of data for algorithm development and validation is both a stubborn and highly significant barrier to ML integration into clinical research, though its full discussion is outside the scope of this manuscript. Many recent ML models, especially deep neural networks, require large amounts of data to train and validate. To ensure generalizability beyond the training data set, developers should use multiple data sources during this process because a number of documented cases demonstrated that algorithms performed significantly differently in validation data sets compared with training data sets [ 70 ]. Because data used in clinical research are often patient related and generated by institutions (in the case of EHR data) or companies (in the case of clinical trial data) at a significant cost, owners of data may be reluctant to share. Even when they are willing to share data, variation in data collection and storage techniques can hamper interoperability. Large datasets, such as MIMIC, eICU, and the UK Biobank, are good resources when other real-world data cannot be obtained [ 71 , 72 , 73 ], but any single data source is inadequate to yield a model that is ready for use, especially because training on retrospective data (such as MIMIC and UK Biobank) does not always translate well to prospective applications. For example, Nestor et al. demonstrated the importance of considering year of care in MIMIC due to temporal drift, and Gong et al. demonstrated methods for feature aggregation across large temporal changes, such as EHR transitions [ 70 , 74 ]. Furthermore, certain disease states and patient types are less likely to be well represented in data generated for the purpose of clinical care. For example, while MIMIC is widely used because of its public availability, models trained on its ICU population are unlikely to generalize to many applications outside critical care. These issues with data availability and quality are intimately associated with problems surrounding reproducibility and replicability [ 75 ], which are more difficult to achieve in ML-driven clinical research for a number of reasons in addition to data availability, including the role of randomness in many ML techniques and the computational expense of model replication. The ongoing difficulties with reproducibility and replicability of ML-driven clinical research threaten to undermine stakeholder confidence in ML integration into clinical research.

Philosophical barriers to ML in clinical research

Explainability refers to the concept that the processes underlying algorithmic output should be explainable to algorithm users in terms they understand. A large amount of research has been devoted to techniques to accomplish this, including attention scores and saliency maps, but concerns about the performance and suitability of these techniques persist [ 76 , 77 , 78 , 79 ]. Though an appealing principle, a significant debate exists about whether the concept of explainability interferes unnecessarily with the ability of ML to positively contribute to clinical care and research. Explainability may lead researchers to incorrectly trust fundamentally flawed models. Proponents of this argument instead champion trustworthiness . Advocates of trustworthiness are of the opinion that many aspects of clinical medicine (and of clinical research)—such as laboratory assays, the complete mechanisms of certain medications, and statistical tests—that are not well or widely understood continue to be used because they have been shown to work reliably and well, even if how or why remains opaque to many end users [ 80 ]. This philosophical barrier has more recently become an operational barrier as well with the passage of the European Union’s General Data Protection Regulation, which requires that automated decision-making algorithms provide “meaningful information about the logic involved.”

Part of the focus on explainability and trustworthiness is due to a desire to understand whether ML algorithms are introducing bias into model output, as was notably shown to be the case in a highly publicized series of ProPublica articles about recidivism prediction algorithms [ 81 ]. Bias in clinical research–focused algorithms has the potential to be equally devastating, for instance, by theoretically suggesting non-representative study cohorts on the basis of a lower predicted participant drop-out.

Guidelines toward overcoming operational and philosophical barriers to ML in clinical research

Because the operational problems previously detailed can potentiate the philosophical tangles of ML use in clinical research, many of the ways to overcome these hurdles overlap. The first and foremost approach to many of these issues includes data provenance, quality, and access. The open-access data sources previously discussed (MIMIC, UK Biobank) are good places to start, but inadequate on their own. Enhanced access to data and the technical expertise required to analyze it is needed. Attempts to render health data interoperable have been ongoing for decades, yielding data standard development initiatives and systems, such as the PCORnet Common Data Model [ 82 ], FHIR [ 83 ], i2b2 [ 84 ], and OMOP [ 85 ]. Recently, regulation requiring health data interoperability through use of core data classes and elements has been enacted by the US Department of Health and Human Services and Centers for Medicare and Medicaid Services on the basis of the 21st Century Cures Act [ 85 , 86 ]. Where barriers to data sharing persist, other options to improve the amount of data available include federated data and cloud-based data access, in which developers can train and validate models on data that they do not own or directly interact with [ 87 , 88 , 89 ]. This has become increasingly common in certain fields, such as genomics and informatics, as evidenced by large consortia, such as eMERGE and OHDSI [ 90 , 91 ].

Recently, a group of European universities and pharmaceutical companies have joined to create “MELODDY,” in which large amounts of drug development data will be shared while protecting companies’ proprietary information, though no academic publications have yet been produced [ 91 ]. “Challenges” in which teams compete to accomplish ML tasks often yield useful models, such as early sepsis prediction or more complete characterization of breast cancer cell lines, which can then be distributed to participating health institutions for validation in their local datasets [ 92 , 93 , 94 , 95 ].

Algorithm validation can both help ensure that ML models are appropriate for their intended clinical research use while also increasing stakeholder confidence in the use of ML in clinical research. Though the specifics continue to be debated, published best practices for specific use cases are emerging [ 96 ]; recent suggestions to standardize such reporting in a one-page “model card” are notable [ 97 ]. For instance, possible model characteristics that could be reported include the intended use cohort, intended outcome of interest, required input data structure and necessary transformations, model type and structure, training cohort specifics, consequences of model application outside of intended use, and algorithm management of uncertainty. Performance metrics that are useful for algorithm evaluation in clinical contexts include receiver-operating characteristic and precision-recall curves, calibration, net benefit, and c-statistic for benefit [ 92 ]. Depending on the intended use case, the most appropriate metrics to report or to optimize will differ. For instance, a model intended to identify patients at high risk for protocol non-adherence may have a higher tolerance for false-positives than one intended to simulate study drug dosages for trial planning. Consensus decisions about obligatory metrics for certain model structures and use cases are required to ensure that models with similar intended uses can be compared with one another. Developers will need to specify how often these metrics should be re-evaluated to assess for model drift. Ideally, evaluation of high-stakes clinical research models should be overseen by a neutral third party, such as a regulatory agency.

To foster trustworthiness even in the absence of explainability, it is essential that the model development and validation processes be transparent , including the reporting of model uncertainty. This may allow more advanced consumers to evaluate the model from a technical standpoint while at the very least helping less-advanced users to identify situations in which a model’s output should be approached with caution. For instance, understanding the source, structure, and drawbacks of the data used for model training and validation will provide insight into how the model’s output might be affected by the quality of the underlying data. However, trustworthiness may be built by running ML models in clinical research contexts in parallel with traditional research methods to show that the ML methods perform at least as well as traditional approaches. Though the importance of these principles may appear self-evident, the large number of ML models being used commercially for clinical research without reporting of the models’ development and performance characteristics suggests more work is needed to align stakeholders in this regard. Even while writing this manuscript, in which peer-reviewed publications were used whenever available, we encountered many cases in which the only “evidence” supporting a model’s performance was a commercial entity’s promotional material. In several other instances, the peer-reviewed articles available to support a commercial model’s performance offered no information at all about the model’s development or validation, which, as discussed earlier, is crucial to engendering trustworthiness. Another concerning aspect of commercial ML-enabled clinical research solutions is private companies’ and health care systems’ practice of training, validating, and applying models using patient data under the guise of quality improvement initiatives, thereby avoiding the need for ethical/institutional review board approval or patient consent [ 93 ]. This practice puts the entire field of ML development at risk of generating biased models and/or losing stakeholder buy-in (as occurred in dramatic fashion with the UK’s “Care.data” initiative) [ 94 ] and illustrates the need to build a more reasonable path toward ethical data sharing and more stringent processes surrounding model development and validation.

Although no FDA guidance is yet available specific to ML in clinical research, guidance on ML in clinical care and commentary from FDA representatives suggest several possible features of a regulatory approach to ML in clinical research. For instance, the FDA’s proposed ML-specific modifications to the “Software as a Medical Device” Regulations (SaMD) draw a distinction between fixed algorithms that were trained using ML techniques but frozen prior to deployment and those that continue to learn “in the wild.” These latter algorithms may more powerfully take advantage of the large amounts of data afforded by ongoing use but also pose additional risks of model drift with the potential need for iterative updates to the algorithm. In particular, model drift should often be expected because models that are incorporated into the decision-making process will inherently change the data they are exposed to in the future. The proposed ML-specific modifications to SaMD guidance outline an institution or organization-level approval pathway that would facilitate these ongoing algorithm updates within pre-approved boundaries (Fig. 3 ).

figure 3

FDA-proposed workflow to regulate machine learning algorithms under the Software as a Medical Device framework. From: Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device: Discussion paper and request for feedback. https://www.fda.gov/media/122535/download . Accessed 17 May 2020

The optimal frequency of model re-evaluation by the FDA has yet to be determined (and may vary based off the model type, training set, and intended use), but clearly some form of recurrent review will be needed, prompted either by a certain time period, certain events (for instance, a global pandemic), or both. Discussion with representatives from the FDA indicates that ML in clinical research is viewed as a potentially high-risk use case due to the potential to propagate errors or biases through the algorithm into research studies; however, its potential opportunities were widely appreciated. Until formalized guidance about ML in clinical research is released, the FDA has clearly stated a willingness to work with sponsors and stakeholders on a case-by-case basis to determine the appropriate role of ML in research intended to support a regulatory application. However, this regulatory uncertainty could potentially stifle sponsors’ and stakeholders’ willingness to invest in ML for clinical research until guidance is drafted. This, in turn, may require additional work at a legislative level to provide a framework for further FDA guidance.

Concerns of bias are central to clinical research even when ML is not involved: clinical research and care have long histories of gender, racial, and socioeconomic bias [ 95 , 96 ]. The ability of ML to potentiate and perpetuate bias in clinical research, possibly without study teams’ awareness, must be actively managed. To the extent that bias can be identified, it can often be addressed and reduced; a worst-case scenario is application of a model with unknown bias in a new cohort with high-stakes results. As with much of ML in clinical research, data quality and quantity are critical in combating bias. No single perfect dataset exists, especially as models trained on real-world data will replicate the intentional or unintentional biases of the clinicians and researchers who generated those data [ 97 ]. Therefore, training models on more independent and diverse datasets decreases the likelihood of occult bias [ 98 ]. Additionally, bias reduction can be approached through the model construction itself, such as by de-biasing word embeddings and using counterfactual fairness [ 99 , 100 , 101 , 102 ]. Clinical research teams may pre-specify certain subgroups of interest in which the algorithm must perform equally well [ 103 ]. Finally, while ML raises the specter of reinforcing and more efficiently operationalizing historical discrimination, ML may help us de-bias clinical research and care by monitoring and drawing attention to bias [ 98 ]. Bias reduction is an area of ML in clinical research in which multi-disciplinary collaboration is especially vital and powerful: clinical scientists may be able to share perspective on long-standing biases in their domains of expertise, while more diverse teams may offer innovative insights into de-biasing ML models.

While traditional double-blinded, randomized, controlled clinical trials with their associated statistical methodologies remain the gold standard for biomedical evidence generation, augmentation with ML techniques offers the potential to improve the success and efficiency of clinical research, increasing its positive impact for all stakeholders. To the extent that ML-enabled clinical research can improve the efficiency and quality of biomedical evidence, it may save human lives and reduce human suffering, introducing an ethical imperative to explore this possibility. Realizing this potential will require overcoming issues with data structure and access, definitions of outcomes, transparency of development and validation processes, objectivity of certification, and the possibility of bias. The potential applications of ML to clinical research currently outstrip its actual use, both because few prospective studies are available about the relative effectiveness of ML versus traditional approaches and because change requires time, energy, and cooperation. Stakeholder willingness to integrate ML into clinical research relies in part on robust responses to issues of data provenance, bias, and validation as well as confidence in the regulatory structure surrounding ML in clinical research. The use of ML algorithms whose development has been opaque and without peer-reviewed publication must be addressed. The attendees of the January 2020 conference on ML in clinical research represent a broad swath of stakeholders with differing priorities and clinical research–related challenges, but all in attendance agreed that communication and collaboration are essential to implementation of this promising technology. Transparent discussion about the potential benefits and drawbacks of ML for clinical research and the sharing of best practices must continue not only in the academic community but in the lay press and government as well to ensure that ML in clinical research is applied in a fair, ethical, and open manner that is acceptable to all.

Availability of data and materials

Not applicable

Change history

06 september 2021.

A Correction to this paper has been published: https://doi.org/10.1186/s13063-021-05571-4

Abbreviations

Electronic health record

US Food and Drug Administration

Machine learning

Natural language processing

Software as a Medical Device

Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–10. https://doi.org/10.1038/s41586-019-1923-7 .

Article   PubMed   CAS   Google Scholar  

Fauqueur JTA, Togia T. Constructing large scale biomedical knowledge bases from scratch with rapid annotation of interpretable patterns. In: Proceedings of the 18th BioNLP Workshop and Shared Task; 2019. https://doi.org/10.18653/v1/w19-5016 .

Chapter   Google Scholar  

Jia R, Wong C, Poon H. Document-level N-ary relation extraction with multiscale representation learning. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); 2019; Minneapolis: Association for Computational Linguistics.  https://ui.adsabs.harvard.edu/abs/2019arXiv190402347J .

Dezso Z, Ceccarelli M. Machine learning prediction of oncology drug targets based on protein and network properties. BMC Bioinformatics. 2020;21(1):104. https://doi.org/10.1186/s12859-020-3442-9 .

Article   PubMed   PubMed Central   CAS   Google Scholar  

Bagherian M, Sabeti E, Wang K, Sartor MA, Nikolovska-Coleska Z, Najarian K. Machine learning approaches and databases for prediction of drug-target interaction: a survey paper. Brief Bioinform. 2021;22(1):247–69. https://doi.org/10.1093/bib/bbz157 .

Liu QAM, Brockschmidt M, Gaunt AL. Constrained graph variational autoencoders for molecule design. NeurIPS 2018. 2018;arXiv:1805.09076:7806–15.

Madhukar NS, Khade PK, Huang L, Gayvert K, Galletti G, Stogniew M, et al. A Bayesian machine learning approach for drug target identification using diverse data types. Nat Commun. 2019;10(1):5221. https://doi.org/10.1038/s41467-019-12928-6 .

Langner S, Hase F, Perea JD, Stubhan T, Hauch J, Roch LM, et al. Beyond ternary OPV: high-throughput experimentation and self-driving laboratories optimize multicomponent systems. Adv Mater. 2020;32(14):e1907801. https://doi.org/10.1002/adma.201907801 .

Granda JM, Donina L, Dragone V, Long DL, Cronin L. Controlling an organic synthesis robot with machine learning to search for new reactivity. Nature. 2018;559(7714):377–81. https://doi.org/10.1038/s41586-018-0307-8 .

Koh D. Sumitomo Dainippon Pharma and Exscientia achieve breakthrough in AI drug discovery: Healthcare IT News - Portland, ME: Healthcare IT News; 2020.

Romero K, Ito K, Rogers JA, Polhamus D, Qiu R, Stephenson D, et al. The future is now: model-based clinical trial design for Alzheimer's disease. Clin Pharmacol Ther. 2015;97(3):210–4. https://doi.org/10.1002/cpt.16 .

Zhao Y, Zeng D, Socinski MA, Kosorok MR. Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics. 2011;67(4):1422–33. https://doi.org/10.1111/j.1541-0420.2011.01572.x .

Article   PubMed   PubMed Central   Google Scholar  

trials.ai 2019 [cited 2021 February 2]. Available from: trials.ai .

Wong CH, Siah KW, Lo AW. Estimation of clinical trial success rates and related parameters. Biostatistics. 2019;20(2):273–86. https://doi.org/10.1093/biostatistics/kxx069 .

Article   PubMed   Google Scholar  

Schork NJ. Personalized medicine: time for one-person trials. Nature. 2015;520(7549):609–11. https://doi.org/10.1038/520609a .

Glicksberg BS, Miotto R, Johnson KW, Shameer K, Li L, Chen R, et al. Automated disease cohort selection using word embeddings from electronic health records. Pac Symp Biocomput. 2018;23:145–56.

PubMed   PubMed Central   Google Scholar  

Liao KP, Cai T, Savova GK, Murphy SN, Karlson EW, Ananthakrishnan AN, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ. 2015;350(apr24 11):h1885. https://doi.org/10.1136/bmj.h1885 .

Li L, Cheng WY, Glicksberg BS, Gottesman O, Tamler R, Chen R, et al. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci Transl Med. 2015;7(311):311ra174.

Article   CAS   Google Scholar  

Our Solution 2021 [cited 2021 February 2]. Available from: https://www.bullfrogai.com/our-solution/ .

Zhang X, Xiao C, Glass LM, Sun J. DeepEnroll: patient-trial matching with deep embedding and entailment prediction. In: Proceedings of the Web Conference 2020. Taipei: Association for Computing Machinery; 2020. p. 1029–37.

Calaprice-Whitty D, Galil K, Salloum W, Zariv A, Jimenez B. Improving clinical trial participant prescreening with artificial intelligence (AI): a comparison of the results of AI-assisted vs standard methods in 3 oncology trials. Ther Innov Regul Sci. 2020;54(1):69–74. https://doi.org/10.1007/s43441-019-00030-4 .

How it works 2019 [cited 2021 February 2]. Available from: https://deep6.ai/how-it-works/ .

Vassy JL, Ho YL, Honerlaw J, Cho K, Gaziano JM, Wilson PWF, et al. Yield and bias in defining a cohort study baseline from electronic health record data. J Biomed Inform. 2018;78:54–9. https://doi.org/10.1016/j.jbi.2017.12.017 .

Weber GM, Adams WG, Bernstam EV, Bickel JP, Fox KP, Marsolo K, et al. Biases introduced by filtering electronic health records for patients with “complete data”. J Am Med Inform Assoc. 2017;24(6):1134–41. https://doi.org/10.1093/jamia/ocx071 .

Bain EE, Shafner L, Walling DP, Othman AA, Chuang-Stein C, Hinkle J, et al. Use of a novel artificial intelligence platform on mobile devices to assess dosing compliance in a phase 2 clinical trial in subjects with schizophrenia. JMIR Mhealth Uhealth. 2017;5(2):e18. https://doi.org/10.2196/mhealth.7030 .

Labovitz DL, Shafner L, Reyes Gil M, Virmani D, Hanina A. Using artificial intelligence to reduce the risk of nonadherence in patients on anticoagulation therapy. Stroke. 2017;48(5):1416–9. https://doi.org/10.1161/STROKEAHA.116.016281 .

Adamson AS, Smith A. Machine learning and health care disparities in dermatology. JAMA Dermatol. 2018;154(11):1247–8. https://doi.org/10.1001/jamadermatol.2018.2348 .

Burlingame EA, Margolin AA, Gray JW, Chang YH. SHIFT: speedy histopathological-to-immunofluorescent translation of whole slide images using conditional generative adversarial networks. Proc SPIE Int Soc Opt Eng. 2018;10581.  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6166432/ .

Han J, Chen K, Fang L, Zhang S, Wang F, Ma H, et al. Improving the efficacy of the data entry process for clinical research with a natural language processing-driven medical information extraction system: quantitative field research. JMIR Med Inform. 2019;7(3):e13331. https://doi.org/10.2196/13331 .

Fonferko-Shadrach B, Lacey AS, Roberts A, Akbari A, Thompson S, Ford DV, et al. Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system. BMJ Open. 2019;9(4):e023232. https://doi.org/10.1136/bmjopen-2018-023232 .

Gavrielov-Yusim N, Kurzinger ML, Nishikawa C, Pan C, Pouget J, Epstein LB, et al. Comparison of text processing methods in social media-based signal detection. Pharmacoepidemiol Drug Saf. 2019;28(10):1309–17. https://doi.org/10.1002/pds.4857 .

Barnett I, Torous J, Staples P, Sandoval L, Keshavan M, Onnela JP. Relapse prediction in schizophrenia through digital phenotyping: a pilot study. Neuropsychopharmacology. 2018;43(8):1660–6. https://doi.org/10.1038/s41386-018-0030-z .

Chaudhuri S, Oudejans D, Thompson HJ, Demiris G. Real-world accuracy and use of a wearable fall detection device by older adults. J Am Geriatr Soc. 2015;63(11):2415–6. https://doi.org/10.1111/jgs.13804 .

Chen R, Jankovic F, Marinsek N, Foschini L, Kourtis L, Signorini A, et al. Developing measures of cognitive impairment in the real world from consumer-grade multimodal sensor streams. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage: Association for Computing Machinery; 2019. p. 2145–55.

Yurtman A, Barshan B, Fidan B. Activity recognition invariant to wearable sensor unit orientation using differential rotational transformations represented by quaternions. Sensors (Basel). 2018;18(8):2725. https://pubmed.ncbi.nlm.nih.gov/30126235/ .

Lu K, Yang L, Seoane F, Abtahi F, Forsman M, Lindecrantz K. Fusion of heart rate, respiration and motion measurements from a wearable sensor system to enhance energy expenditure estimation. Sensors (Basel). 2018;18(9):3092. https://pubmed.ncbi.nlm.nih.gov/30223429/ .

Cheung YK, Hsueh PS, Ensari I, Willey JZ, Diaz KM. Quantile coarsening analysis of high-volume wearable activity data in a longitudinal observational study. Sensors (Basel). 2018;18(9):3056. https://pubmed.ncbi.nlm.nih.gov/30213093/ .

Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med. 2019;25(1):65–9. https://doi.org/10.1038/s41591-018-0268-3 .

Ozkanca Y, Ozturk MG, Ekmekci MN, Atkins DC, Demiroglu C, Ghomi RH. Depression screening from voice samples of patients affected by Parkinson's disease. Digit Biomark. 2019;3(2):72–82. https://doi.org/10.1159/000500354 .

Moreau A, Anderer P, Ross M, Cerny A, Almazan TH, Peterson B, et al. Detection of nocturnal scratching movements in patients with atopic dermatitis using accelerometers and recurrent neural networks. IEEE J Biomed Health Inform. 2018;22(4):1011–8. https://doi.org/10.1109/JBHI.2017.2710798 .

Han X, Hu Y, Foschini L, Chinitz L, Jankelson L, Ranganath R. Deep learning models for electrocardiograms are susceptible to adversarial attack. Nat Med. 2020;26(3):360–3. https://doi.org/10.1038/s41591-020-0791-x Epub 2020/03/11. PubMed PMID: 32152582.

Doerr M, Maguire Truong A, Bot BM, Wilbanks J, Suver C, Mangravite LM. Formative evaluation of participant experience with mobile econsent in the app-mediated Parkinson mPower study: a mixed methods study. JMIR Mhealth Uhealth. 2017;5(2):e14. https://doi.org/10.2196/mhealth.6521 .

Savova GK, Danciu I, Alamudun F, Miller T, Lin C, Bitterman DS, et al. Use of natural language processing to extract clinical cancer phenotypes from electronic medical records. Cancer Res. 2019;79(21):5463–70. https://doi.org/10.1158/0008-5472.CAN-19-0579 .

Malke JC, Jin S, Camp SP, Lari B, Kell T, Simon JM, et al. Enhancing case capture, quality, and completeness of primary melanoma pathology records via natural language processing. JCO Clin Cancer Inform. 2019;3:1–11. https://doi.org/10.1200/CCI.19.00006 .

Vaci N, Liu Q, Kormilitzin A, De Crescenzo F, Kurtulmus A, Harvey J, et al. Natural language processing for structuring clinical text data on depression using UK-CRIS. Evid Based Ment Health. 2020;23(1):21–6. https://doi.org/10.1136/ebmental-2019-300134 .

Tian Q, Liu M, Min L, An J, Lu X, Duan H. An automated data verification approach for improving data quality in a clinical registry. Comput Methods Programs Biomed. 2019;181:104840. https://doi.org/10.1016/j.cmpb.2019.01.012 .

Estiri H, Murphy SN. Semi-supervised encoding for outlier detection in clinical observation data. Comput Methods Programs Biomed. 2019;181:104830. https://doi.org/10.1016/j.cmpb.2019.01.002 .

Glass, LMS G; Patil, R. AI in clinical development: improving safety and accelerating results. [White paper]. In press 2019.

Google Scholar  

Hicks KA, Mahaffey KW, Mehran R, Nissen SE, Wiviott SD, Dunn B, et al. 2017 Cardiovascular and stroke endpoint definitions for clinical trials. Circulation. 2018;137(9):961–72. https://doi.org/10.1161/CIRCULATIONAHA.117.033502 .

Liu Y, Gopalakrishnan V. An overview and evaluation of recent machine learning imputation methods using cardiac imaging data. Data (Basel). 2017;2(1):8. https://pubmed.ncbi.nlm.nih.gov/28243594/ .

Phung S, Kumar A, Kim J. A deep learning technique for imputing missing healthcare data. Conf Proc IEEE Eng Med Biol Soc. 2019;2019:6513–6. https://doi.org/10.1109/EMBC.2019.8856760 Epub 2020/01/18PubMed PMID: 31947333.

Article   Google Scholar  

Qiu YL, Zheng H, Gevaert OJ. A deep learning framework for imputing missing values in genomic data; 2018.

Book   Google Scholar  

Feng T, Narayanan S. Imputing missing data in large-scale multivariate biomedical wearable recordings using bidirectional recurrent neural networks with temporal activation regularization. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); 2019.

Luo Y, Szolovits P, Dighe AS, Baron JM. 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. J Am Med Inform Assoc. 2018;25(6):645–53. https://doi.org/10.1093/jamia/ocx133 .

Ngufor C, Warner MA, Murphree DH, Liu H, Carter R, Storlie CB, et al. Identification of Clinically meaningful plasma transfusion subgroups using unsupervised random forest clustering. AMIA Annu Symp Proc. 2017;2017:1332–41.

PubMed   Google Scholar  

Tomic A, Tomic I, Rosenberg-Hasson Y, Dekker CL, Maecker HT, Davis MM. SIMON, an automated machine learning system, reveals immune signatures of influenza vaccine responses. J Immunol. 2019;203(3):749–59. https://doi.org/10.4049/jimmunol.1900033 .

Watson JA, Holmes CC. Machine learning analysis plans for randomised controlled trials: detecting treatment effect heterogeneity with strict control of type I error. Trials. 2020;21(1):156. https://doi.org/10.1186/s13063-020-4076-y .

Rigdon J, Baiocchi M, Basu S. Preventing false discovery of heterogeneous treatment effect subgroups in randomized trials. Trials. 2018;19(1):382. https://doi.org/10.1186/s13063-018-2774-5 .

Kalscheur MM, Kipp RT, Tattersall MC, Mei C, Buhr KA, DeMets DL, et al. Machine learning algorithm predicts cardiac resynchronization therapy outcomes: lessons from the companion trial. Circ Arrhythm Electrophysiol. 2018;11(1):e005499. https://doi.org/10.1161/CIRCEP.117.005499 .

Linden A, Yarnold PR. Combining machine learning and propensity score weighting to estimate causal effects in multivalued treatments. J Eval Clin Pract. 2016;22(6):871–81. https://doi.org/10.1111/jep.12610 .

Schuler MS, Rose S. Targeted maximum likelihood estimation for causal inference in observational studies. Am J Epidemiol. 2017;185(1):65–73. https://doi.org/10.1093/aje/kww165 .

Wendling T, Jung K, Callahan A, Schuler A, Shah NH, Gallego B. Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases. Stat Med. 2018;37(23):3309–24. https://doi.org/10.1002/sim.7820 .

Schomaker M, Luque-Fernandez MA, Leroy V, Davies MA. Using longitudinal targeted maximum likelihood estimation in complex settings with dynamic interventions. Stat Med. 2019;38(24):4888–911. https://doi.org/10.1002/sim.8340 .

Pirracchio R, Petersen ML, van der Laan M. Improving propensity score estimators’ robustness to model misspecification using super learner. Am J Epidemiol. 2015;181(2):108–19. https://doi.org/10.1093/aje/kwu253 .

Gottesman O, Johansson F, Komorowski M, Faisal A, Sontag D, Doshi-Velez F, et al. Guidelines for reinforcement learning in healthcare. Nat Med. 2019;25(1):16–8. https://doi.org/10.1038/s41591-018-0310-5 .

Yoon J, Zame WR, Banerjee A, Cadeiras M, Alaa AM, van der Schaar M. Personalized survival predictions via trees of predictors: an application to cardiac transplantation. PLoS One. 2018;13(3):e0194985. https://doi.org/10.1371/journal.pone.0194985 .

Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med. 2018;24(11):1716–20. https://doi.org/10.1038/s41591-018-0213-5 .

Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. Practical guidance on artificial intelligence for health-care data. Lancet Digit Health. 2019;1(4):e157–9. https://doi.org/10.1016/S2589-7500(19)30084-6 .

Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, Doshi-Velez F, et al. Do no harm: a roadmap for responsible machine learning for health care. Nat Med. 2019;25(9):1337–40. https://doi.org/10.1038/s41591-019-0548-6 Epub 2019/08/21. PubMed PMID: 31427808.

Nestor B, McDermott M, Chauhan G, et al. Rethinking clinical prediction: why machine learning must consider year of care and feature aggregation. arXiv preprint 2018;arXiv:181112583.

Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1):160035. https://doi.org/10.1038/sdata.2016.35 .

Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data. 2018;5(1):180178. https://doi.org/10.1038/sdata.2018.178 .

UK Biobank. www.ukbiobank.ac.uk . Accessed 22 Mar 2021.

Gong JJ, Naumann T, Szolovits P, Guttag JV. Predicting clinical outcomes across changing electronic health record systems. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Halifax: Association for Computing Machinery; 2017. p. 1497–505.

Beam AL, Manrai AK, Ghassemi M. Challenges to the reproducibility of machine learning models in health care. JAMA. 2020;323(4):305–6. https://doi.org/10.1001/jama.2019.20866 .

Adebayo J, Gilmer J, Muelly M, Goodfellow I, Hardt M, Kim B. Sanity checks for saliency maps. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc.; 2018. p. 9525–36.

Wiegreffe S, Pinter Y. Attention is not not explanation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; 2019.

Jain S, Wallace BC. Attention is not explanation: NAACL-HLT; 2019.

Serrano S, Smith NA. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics.

Sendak M, Elish MC, Gao M, Futoma J, Ratliff W, Nichols M, et al. “The human body is a black box”: supporting clinical decision-making with deep learning. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. Barcelona: Association for Computing Machinery; 2020. p. 99–109.

Angwin J LJ, Mattu S, Kirchner L. Machine bias. ProPublica. 2016 13 May 2020. Available from: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing .

Qualls LG, Phillips TA, Hammill BG, Topping J, Louzao DM, Brown JS, et al. Evaluating foundational data quality in the National Patient-Centered Clinical Research Network (PCORnet(R)). EGEMS (Wash DC). 2018;6(1):3.

PubMed Central   Google Scholar  

Bosca D, Moner D, Maldonado JA, Robles M. Combining archetypes with fast health interoperability resources in future-proof health information systems. Stud Health Technol Inform. 2015;210:180–4.

Klann JG, Abend A, Raghavan VA, Mandl KD, Murphy SN. Data interchange using i2b2. J Am Med Inform Assoc. 2016;23(5):909–15. https://doi.org/10.1093/jamia/ocv188 .

Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc. 2012;19(1):54–60. https://doi.org/10.1136/amiajnl-2011-000376 .

21st Century Cures Act: Interoperability, information blocking, and the ONC Health IT Certification Program [updated 1 May 2020]. Available from: https://www.federalregister.gov/documents/2020/05/01/2020-07419/21st-century-cures-act-interoperability-information-blocking-and-the-onc-health-it-certification . Accessed 16 May 2020.

Oh M, Park S, Kim S, Chae H. Machine learning-based analysis of multi-omics data on the cloud for investigating gene regulations. Brief Bioinform. 2020. Epub 2020/04/01. https://doi.org/10.1093/bib/bbaa032 .

Czeizler E, Wiessler W, Koester T, Hakala M, Basiri S, Jordan P, et al. Using federated data sources and Varian Learning Portal framework to train a neural network model for automatic organ segmentation. Phys Med. 2020;72:39–45. https://doi.org/10.1016/j.ejmp.2020.03.011 .

Zerka F, Barakat S, Walsh S, Bogowicz M, Leijenaar RTH, Jochems A, et al. Systematic review of privacy-preserving distributed machine learning from federated databases in health care. JCO Clin Cancer Inform. 2020;4:184–200. https://doi.org/10.1200/CCI.19.00047 .

McCarty CA, Chisholm RL, Chute CG, Kullo IJ, Jarvik GP, Larson EB, et al. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011;4(1):13. https://doi.org/10.1186/1755-8794-4-13 .

Boyce RD, Ryan PB, Noren GN, Schuemie MJ, Reich C, Duke J, et al. Bridging islands of information to establish an integrated knowledge base of drugs and health outcomes of interest. Drug Saf. 2014;37(8):557–67. https://doi.org/10.1007/s40264-014-0189-0 .

van Klaveren D, Steyerberg EW, Serruys PW, Kent DM. The proposed ‘concordance-statistic for benefit’ provided a useful metric when modeling heterogeneous treatment effects. J Clin Epidemiol. 2018;94:59–68. https://doi.org/10.1016/j.jclinepi.2017.10.021 .

Robbins RBE. An invisible hand: patients aren’t being told about the AI systems advising their care. STAT; 2020.

Sterckx S, Rakic V, Cockbain J, Borry P. “You hoped we would sleep walk into accepting the collection of our data”: controversies surrounding the UK care.data scheme and their wider relevance for biomedical research. Med Health Care Philos. 2016;19(2):177–90. https://doi.org/10.1007/s11019-015-9661-6 .

Committee on Understanding and Eliminating Racial and Ethnic Disparities in Health Care. Confronting racial and ethnic disparities in health care. Washington (DC): National Academies Press; 2003.

Criado PC. Invisible women. New York: Harry N. Abrams; 2019.

Zhang H, Lu AX, Abdalla M, McDermott M, Ghassemi M. Hurtful words: quantifying biases in clinical contextual word embeddings. In: Proceedings of the ACM Conference on Health, Inference, and Learning. Toronto: Association for Computing Machinery; 2020. p. 110–20.

Chen IY, Joshi S, Ghassemi M. Treating health disparities with artificial intelligence. Nat Med. 2020;26(1):16–7. https://doi.org/10.1038/s41591-019-0649-2 .

Bolukbasi T, Chang K-W, Zou J, Saligrama V, Kalai A. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona: Curran Associates Inc.; 2016. p. 4356–64.

Kusner, Matt, Loftus, Joshua, Russell, Chris and Silva, Ricardo. Counterfactual fairness Conference. Proceedings of the 31st International Conference on Neural Information Processing Systems Conference. Long Beach, California, USA Publisher: Curran Associates Inc; 2017:4069–4079.

Hardt M, Price E, Srebro N. Equality of opportunity in supervised learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona: Curran Associates Inc.; 2016. p. 3323–31.

Ustun B, Liu Y, Parkes D. Fairness without harm: decoupled classifiers with preference guarantees. In: Kamalika C, Ruslan S, editors. Proceedings of the 36th International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR %J Proceedings of Machine Learning Research; 2019. p. 6373–82.

Noseworthy PA, Attia ZI, Brewer LC, Hayes SN, Yao X, Kapa S, et al. Assessing and Mitigating bias in medical artificial intelligence: the effects of race and ethnicity on a deep learning model for ECG analysis. Circ Arrhythm Electrophysiol. 2020;13(3):e007988. https://doi.org/10.1161/CIRCEP.119.007988 .

Download references

Acknowledgements

The authors would like to acknowledge the contributions of Peter Hoffmann and Brooke Walker to the editing and preparation of this manuscript.

Funding support for the meeting was provided through registration fees from Amgen Inc., AstraZeneca, Bayer AG, Boehringer-Ingelheim, Cytokinetics, Eli Lilly & Company, Evidation, IQVIA, Janssen, Microsoft, Pfizer, Sanofi, and Verily. No government funds were used for this meeting.

Author information

Authors and affiliations.

Duke Clinical Research Institute, Duke University School of Medicine, Box 2834, Durham, NC, 27701, USA

E. Hope Weissler, Scott H. Kollins, Lesley Curtis & Erich Huang

Microsoft Research, Cambridge, MA, USA

Tristan Naumann

AstraZeneca, Gothenburg, Sweden

Tomas Andersson, Faisal Khan, Khader Shameer & Emmette Hutchison

Courant Institute of Mathematical Science, New York University, New York, NY, USA

Rajesh Ranganath

Englander Institute for Precision Medicine, Weill Cornell Medical College, New York, NY, USA

Olivier Elemento

Northwestern University Clinical and Translational Sciences Institute, Northwestern University, Chicago, IL, USA

Division Pharmaceuticals, Open Innovation and Digital Technologies, Bayer AG, Wuppertal, Germany

Daniel F. Freitag

University of Alberta, Edmonton, Alberta, Canada

James Benoit

Department of Computer Science, Tufts University, Medford, MA, USA

Michael C. Hughes

Billion Minds, Inc., Seattle, WA, USA

Paul Slater

Verana Health, San Francisco, CA, USA

Matthew Roe

Boehringer-Ingelheim, Burlington, Canada

Sanofi, Cambridge, MA, USA

Zhaoling Meng

Sanofi, Washington, DC, USA

Jennifer L. Wong

Duke Forge, Durham, NC, USA

Erich Huang

Vector Institute, University of Toronto, Toronto, Ontario, Canada

Marzyeh Ghassemi

Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, USA

Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, USA

CIFAR AI Chair, Vector Institute, Toronto, Ontario, Canada

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the conception and design of the work and the analysis and interpretation of the data consisting of reports (peer-reviewed and otherwise) concerning the development, performance, and use of ML in clinical research. EHW drafted the work. All authors substantively revised the work. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to E. Hope Weissler .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

HW has nothing to disclose.

TN has nothing to disclose.

TA is an employee of AstraZeneca.

RR has nothing to disclose.

OE is a co-founder of and holds equity in OneThree Biotech and Volastra Therapeutics and is scientific advisor for and holds equity in Freenome and Owkin,

YL has nothing to disclose.

DF is an employee of Bayer AG, Germany.

JB has nothing to disclose.

MH reports personal fees from Duke Clinical Research Institute, non-financial support from RGI Informatics, LLC, and grants from Oracle Labs.

FK is an employee of AstraZeneca.

PS has nothing to disclose.

SK is an employee of AstraZeneca; has served as an advisor for Kencor Health and OccamzRazor; has received consulting fees from Google Cloud (Alphabet), McKinsey, and LEK Consulting; was an employee of Philips Healthcare; and has a patent (Diagnosis and Classification of Left Ventricular Diastolic Dysfunction Using a Computer) issued to MSIP.

Dr. Roe reports grants from the American College of Cardiology, American Heart Association, Bayer Pharmaceuticals, Familial Hypercholesterolemia Foundation, Ferring Pharmaceuticals, Myokardia, and Patient Centered Outcomes Research Institute; grants and personal fees from Amgen, AstraZeneca, and Sanofi Aventis; personal fees from Janssen Pharmaceuticals, Elsevier Publishers, Regeneron, Roche-Genetech, Eli Lilly, Novo Nordisk, Pfizer, and Signal Path; and is an employee of Verana Health.

EH is an employee of AstraZeneca.

SK reports personal fees from Holmusk.

UB is an employee of Boehringer-Ingelheim.

ZM has nothing to disclose.

JW reports being an employee of Sanofi US.

LC has nothing to disclose.

EH reports personal fees from Valo Health and is a founder of (with equity in) kelaHealth and Clinetic.

MG has nothing to disclose.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised:Following the publication of the original article, we were notified that current affiliations 17, 18 and 19 were erroneously added to the first author rather than the senior author (Marzyeh Ghassemi).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Weissler, E.H., Naumann, T., Andersson, T. et al. The role of machine learning in clinical research: transforming the future of evidence generation. Trials 22 , 537 (2021). https://doi.org/10.1186/s13063-021-05489-x

Download citation

Received : 30 April 2021

Accepted : 26 July 2021

Published : 16 August 2021

DOI : https://doi.org/10.1186/s13063-021-05489-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Clinical trials as topic; Machine learning
  • Artificial intelligence
  • Research design
  • Research ethics

ISSN: 1745-6215

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

machine learning in healthcare research papers pdf

Application of Machine Learning in Healthcare: An Analysis

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Advertisement

Advertisement

Systematic Mapping Study of AI/Machine Learning in Healthcare and Future Directions

  • Survey Article
  • Published: 16 September 2021
  • Volume 2 , article number  461 , ( 2021 )

Cite this article

  • Gaurav Parashar   ORCID: orcid.org/0000-0003-4869-1819 1 ,
  • Alka Chaudhary 1 &
  • Ajay Rana 1  

4154 Accesses

9 Citations

3 Altmetric

Explore all metrics

This study attempts to categorise research conducted in the area of: use of machine learning in healthcare , using a systematic mapping study methodology. In our attempt, we reviewed literature from top journals, articles, and conference papers by using the keywords use of machine learning in healthcare . We queried Google Scholar, resulted in 1400 papers, and then categorised the results on the basis of the objective of the study, the methodology adopted, type of problem attempted and disease studied. As a result we were able to categorize study in five different categories namely, interpretable ML, evaluation of medical images, processing of EHR, security/privacy framework, and transfer learning. In the study we also found that most of the authors have studied cancer, and one of the least studied disease was epilepsy, evaluation of medical images is the most researched and a new field of research, Interpretable ML/Explainable AI, is gaining momentum. Our basic intent is to provide a fair idea to future researchers about the field and future directions.

Similar content being viewed by others

machine learning in healthcare research papers pdf

Deep learning in radiology: ethics of data and on the value of algorithm transparency, interpretability and explainability

Alvaro Fernandez-Quilez

machine learning in healthcare research papers pdf

Designing User-Centric Explanations for Medical Imaging with Informed Machine Learning

machine learning in healthcare research papers pdf

Interpretable AI in Healthcare: Enhancing Fairness, Safety, and Trust

Avoid common mistakes on your manuscript.

Introduction

Artificial intelligence (AI) can be defined as a field in which the machine demonstrates intelligence by learning itself. It can be done by deploying various techniques & algorithms to understand human intelligence but does not confine to—John McCarthy. Even though if we do not specifically program the machine and still it can automatically learn and improve itself this defines an intelligent behaviour of machine. Machine learning (ML) is a specific field of AI which relates to techniques that can automatically learn from experience.

The use of machine learning in healthcare had shown many promising solutions which had created confidence in the field. Researchers had used ICT tools with ML in developing solutions for increasing the effectiveness of the earlier methods or procedures. The field of healthcare had also shown tremendous improvement after the use of Big Data, ICT, and AI/machine learning (ML) in precision and speed. These tools have greatly helped physicians and healthcare professionals in their day-to-day working, research, testing the effect of biomedicine on humans using simulations. Every single detail of the patient gets recorded by the doctors with the other information like clinical notes, prescriptions, medical test results, diagnosis, X-rays, MRI scan, sonographic images, etc. This data becomes huge repository of information, which if churned, could give us better insights of treatment, fruitful suggestions and recommendations in diagnosis, progressive pattern of one disease could be correlated to another disease and may lead to new procedure for treatment of a disease and many more. There may be a chance that healthcare professional overlooked a symptom, which if not addressed early could lead to loss of life. Therefore, tools like AI/ML could help in better healthcare services.

The use of tools like IBM Watson Footnote 1 and Google DeepMind [ 1 ] have shown impressive results in healthcare. On top of these tools researchers and developers have designed applications, which harness the capabilities of these tools, to provide personalised patient care, better drug discovery, and improved healthcare organisational performance. According to Wired Footnote 2 Google DeepMind was used to identify protein structures associated with SARS-CoV-2 and understand how the virus functions. One of the oldest scientific puzzle of ’protein folding problem’ was also solved and paved the way for faster development of drugs, better treatment by Google DeepMind. Other contributions in the field of healthcare are use of association rules(AR) which helped analyse malaria in Brazil [ 2 ], according to [ 3 ] diagnosing images of X-rays revealed respiratory condition of the patient and helped in better healthcare services.

ML can be applied to varied fields like defence, automation, finance, automobile, and manufacturing in performing tasks like classification, clustering, and forecasting. It can be categorised into three types supervised, unsupervised, and reinforcement learning.

In supervised learning, algorithms learn from the labeled datasets and prepare a model. After training, we give data to the model, which the model has not seen earlier and belongs to the same category so that it can correctly classify it. In unsupervised learning, algorithms themselves learn, by analysing data and the model then prepares a model, which can be used to correctly cluster the elements. Lastly, in reinforcement learning, machine learns itself from its mistakes or maximising rewards and by reducing penalties.

In this paper, we aim to categorise papers based on the healthcare and machine learning.With the use of AI and ML in healthcare, there have been significant changes in the life of healthcare professionals. The accuracy of medical diagnostic has increased, healthcare professionals have an assistant on which they can rely, they can predict diseases like pneumonia, cancer, heart diseases, Tumour, COVID-19, and many more with better accuracy & precision than before.

In this paper, we attempt to categorise research done in use of machine learning in healthcare. According to best of our knowledge this type of categorisation has not been done earlier . This attempt will become the basis of future research in the field. We attempt to categorise them on the basis of the objective of the study, methodology adopted, type of problem attempted, etc. We discussed in section “ Research Methodology ”, in section “ Literature Survey ”, in section “ Results ” and section “ Conclusion ”.

Research Methodology

This section describes the systematic mapping procedure adopted to study the use of AI/ML in the healthcare domain. This study was conducted using the keywords “Machine Learning” OR ”Healthcare” . The search was conducted on Google scholar and considered only results from Nature, Wiley periodicals, Elsevier, Taylor and Francis, IEEE transactions, ACM, SVN, IET, and ArXiv. Following steps were carried out: (1) Definition of research questions (2) conduct search for primary studies (3) screening of papers for inclusion and exclusion (4) keywording using abstracts (5) data extraction and mapping of studies. The above steps were proposed by [ 4 ].

figure 1

The systematic mapping process

The Systematic Mapping Process

We have adopted the systematic mapping process from [ 4 ] and applied it to the study conducted on use of ML in the healthcare .

Systematic Mapping Process is a well defined, comprehensive overview study done on a particular research topic. According to [ 5 ] it helps researchers do verifiable, unbiased literature review, find research gaps by critical examination of research literature, helps collate evidence, reduce reviewer selection bias & publication bias with transparent inclusion and exclusion criteria.

The process is described here:

We first define research questions and scope of the study.

With respect to the questions framed from the previous step now search is conducted and literature is collected.

Proper screening is done to check whether the selected literature is related to the research question and scope of the research.

Abstract and keywords are scanned for critical survey of the content

In a spreadsheet, collected data is mapped with the RQs.

In the following (see Fig.  1 ) we had shown the process which we had implemented in the study.

Definition of Research Questions

The main intent of the study is to find out the use of ML in the field of healthcare. To start with we have formulated three research questions(see Table  1 ) which are based on the topic of the study. Major goals of systematic mapping study are:-

Find review of the research area

Find the quantity, type of research, and result

Find journals of the published research topic

Therefore, on the basis of the above goals following research questions have been formed.

What type of research has been conducted in the field of use of AI/ML in healthcare? Rationale: This question aims to find the type of research, which has been conducted under the field of the healthcare domain. We need to find out papers published under the topic.

What are the broad categories of papers published under the topic? Rationale: The rationale for this question arises from the outcome of the RQ1. FQ1 gives research papers, then we need to find out the broad category under which the paper lie.

What are the different diseases which have been studied and total number of total paper published under it? Rationale: After categorising the papers, we need to find out different diseases which are being studied in the research done by other researchers. The main intent is to find the least studied disease, which can become a starting point for new research.

Conduct Search for Primary Studies

To conduct the search we followed the steps:

Prepare the search string w.r.t to different databases(as described in Table 2 ). Since we have used only Google scholar therefore we had used a broad search string to cover all papers containing the keywords healthcare and machine learning .

Execute the search and collect the results (see Fig.  2 ).

Categorise the results by studying the papers and grouping them together on the basis of the disease studied (As mentioned in the Table  6 ) & intent of the paper (As mentioned in the Table  5 ).

figure 2

Raw text from Google scholar results

We took 1400 search results from google scholar and transferred them to spreadsheet based on the query mentioned in Table  2 . This data of around 1400 entries will be further drilled down in next section by excluding the entries which are not related to the study.

Screening of Papers for Inclusion and Exclusion (Relevant Papers)

In this step we exclude all the papers that are not relevant in the study. By this we also mean that the papers which are not related to the RQs (Refer Table  5 ), papers which do not from Nature, Wiley periodicals, Elsevier, Taylor and Francis, IEEE transactions, ACM, SVN, IET, and ArXiv are excluded from the final list.

Using the above criteria we retained those entries, which are based on Inclusion criteria (Refer Table  3 ). After using the above exclusion criteria we drilled down the entries which we finally considered were 42 .

Keywording Using Abstracts

For our study we followed the systematic process of classifying the results from Google Scholar. For Keywording we followed these steps:

The result collected from the previous step are analysed by surveying abstract.

Abstract are surveyed for keywords and content. Then context of the study is evaluated.

Group the result on the basis of context and keywords (Refer Table  4 ).

Data Extraction and Mapping of Studies

We collected all the information in a spreadsheet with the information like s.no., paper title, abstract, keywords, year of publication, authors, name of publisher, name of periodical/journal/conference, major findings, major shortcomings. After that we mapped the RQs (see Table 1 ) to each entry.

Literature Survey

Interpretable machine learning.

Interpretable models are those which explains itself. Interpretable models are linear regression, logistic regression and decision trees. For instance, if we use decision tree model then we can easily extract decision rules as explanations for the model.

In [ 6 ] authors referred to the use of ML in healthcare with an emphasis on Interpretability. Interpretable ML refers to models which can provide rationale on predictions made by the model. The basic impediment in the adoption of ML in healthcare is its BlackBox nature. Since we have to develop ML as a tool that can act as an assistant to physicians, therefore, we need to make its output more explainable. Mere providing metrics like AUC, recall, precision, F-Score may not suffice. We need to develop more interpretable models that themselves can provide explanations of their predictions. The authors [ 7 ] proposed a model which adds important value to features and make the output interpretable. Authors [ 8 ] developed reasoning through the use of visual indicators making the model interpretable. In [ 9 ] authors proposed an interpretable tree from a decision forest making understandable by humans. As proposed in [ 10 , 11 ] interpretable ML models helps develop a reasonable and data-driven decision support system that results in personalised decisions.

Authors [ 12 ] applied deep learning on medical data of patients for developing interpretable predictions for decision support.

Evaluation of Medical Images

In this category, the authors discussed evaluation of medical images for better diagnosis using machine learning models.

In [ 13 , 14 , 15 , 16 , 17 ] authors used deep learning models, Neural Network to classify different diseases, organ segmentation and compared it with the diagnosis of health care professionals for diagnostic accuracy. In [ 18 ] authors proposed a novel colour deconvolution for stain separation and colour normalisation of images. In [ 19 ] authors performed a comparison of five colour normalisation algorithms and found stain colour normalisation algorithms performed better, which had high stain segmentation accuracy and low computational complexity. In their review paper authors [ 20 ] did a comparison of different image segmentation methods and related studies with medical imaging for AI-assisted diagnosis of COVID-19. In [ 21 ] authors explained AI, ML, DL, and CNN and the use of these techniques in imaging. [ 22 ] discussed image enhancements method with noise suppression by enhancing low light regions.

Processing of Electronic Health Record (EHR)

In this category, we had compiled papers that had processed electronic health records of patients.

In the paper [ 17 ] authors proposed diagnostic of pneumonia in a resource-constrained environment. The authors of [ 23 , 24 , 25 ] discussed the processing of electronic health records and used ML algorithms to categorise disease. The authors [ 26 ] trained their proposed model on large dataset and performed regression and classification to check their effectiveness and accuracy. In [ 27 ], a medical recommendation system was proposed using Fast Fourier transformation coupled with a machine learning ensemble model. The model uses this model for disease risk prediction to provide medical recommendations like medical tests and other recommendations for chronic heart disease patients. In [ 28 ] authors proposed the use of graphical structure of electronic health records and find hidden structure from it. In [ 29 ] proposed a model that provides help to physicians to evaluate the quality of evidence for better decision making. Authors used risk of bias assessment in textual data using Natural Language Processing.

Security/Privacy Framework

Under this category, we will summarised papers related to the security and privacy framework for safeguarding health records transferred over network or internet.

Authors of [ 30 ] researched on novel design of smart and secure healthcare information system by adopting machine learning. It also employed advanced security mechanism to handle the big data of the healthcare industry. This framework used many security tools to secure the data like encryption, monitoring the activity, access control, and many other mechanism. This paper [ 31 ] discussed the privacy-preserving collaborative model using ML tools for medical diagnosis systems.

Most of the privacy protection methods are centralized. There is a need for a decentralized system that can help in mitigating several challenges like single-point-of-failure, modification of records, privacy preservation, improper information exchange that may lead to risk of patient’s life. To protect, many researchers have proposed different algorithms [ 32 , 33 , 34 , 35 ]. Models like VERTIGO, GLORE, and WebDISCO were designed for privacy preservation and predictive modelling. These models aimed to preserve privacy by sending partially-trained machine learning models rather than patient data. This way the information is preserved, and develop trust between different parties.

Many other distributed privacy-preserving models were developed those were based on Blockchain technology. They use the technology to update models as in Blockchain like ModelChain, EXPLORER, Distributed Autonomous Online Learning sequentially.

Secure multiparty computation(SMC) for privacy preservation that do computations on encrypted data with personally identifiable information had opened a new dimension. Data is a very precious commodity, therefore techniques like privacy preserving scoring of Tree Ensembles [ 36 ] are designed to provide a framework that provides cryptographic protocols for sending data securely.

Transfer Learning

In this category, we summarised research papers related to transfer learning. Transfer learning is a technique in which we gain knowledge from one problem and use the same knowledge to solve different but related problem. In [ 37 ] authors proposed a technique for handling missing data using transfer learning perspective. The proposed classifier learn weights and then complete portion of the dataset and then transfer it to the target domain. In [ 38 ] authors used transfer learning approach to predict breast cancer using model trained on a task some other task. A model trained on ImageNet databases containing 1.2 million images is used as feature extractor. The model is combined with other components to perform classification. [ 39 , 40 ] uses data generated by different wearable devices using federated learning, and then builds machine learning model by transfer learning. The study was applied to diagnose Parkinson’s disease.

After the systematic mapping process we got categories of research literature as mentioned in Table  5 . This table describes category and total number of papers under the category.

From Table  5 we can clearly observe that most research is being done in Processing of Medical Images this might be due to the availability of the dataset for research purpose. In case of Processing of EHR, which is second most researched category, might be again due to availability of the dataset. In case of Interpretable ML, since it is a new field and slowly gaining momentum therefore, researchers are taking interest because this gives a rationale to the outcome of the model result. This is a vey important attribute, when it comes to certain domains where high stakes are at risk. Like for example in healthcare, defence, and finance. Lastly, In case of Transfer Learning, it is a field which talks about using the domain knowledge of one domain and use it in another related domain. So according to us researchers use this technique to apply for testing the results. Therefore it has a very limited number of research.

From Table  6 it is clearly evident that most researched disease is Cancer, which is 38 and Pneumonia with frequency as 4, Alzheimer as 3, Parkinson as 2 and Epilepsy as 2 are the least researched diseases. These results have been extracted from 1400 papers downloaded from Google Scholar.

In this paper, we have provided a brief overview of the directions of research in the healthcare domain using Machine Leaning. As described earlier, these papers can show researchers path where they can work. The result is based on literature review done on around 1400 papers and filtered down to 42 papers. As described in section “ Literature Survey ”, we categorised the research into 5 broad areas and found most of the research is done in the field of Evaluation of Medical Images in which authors researched many diseases like cancer, heart disease, COVID-19, Parkinson, etc. Authors used different kinds of dataset like images, voice, and electronic health records. Using these dataset they predicted these diseases using machine learning/AI. As described in section “ Processing of Electronic Health Record (EHR) ” second major contribution is done in this category. We would like to conclude that in section “ Interpretable Machine Learning ” very little research has been done so this area can be chosen for further research.

https://www.healthcareglobal.com/technology-and-ai-3/four-ways-which-watson-transforming-healthcare-sector

https://www.wired.co.uk/article/ai-healthcare-boom-deepmind

Powles J, Hodson H. Google DeepMind and healthcare in an age of algorithms. Health Technol. 2017;7(4):351.

Article   Google Scholar  

Baroni L, Salles R, Salles S, Guedes G, Porto F, Bezerra E, Barcellos C, Pedroso M, Ogasawara E. An analysis of malaria in the Brazilian Legal Amazon using divergent association rules. J Biomed Inf. 2020;108:103512.

Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz C, Shpanskaya K, et al. 2017. arXiv:1711.05225 .

Petersen K, Feldt R, Mujtaba S, Mattsson M. In: 12th international conference on evaluation and assessment in software engineering (EASE) 12; 2008. pp. 1–10.

Haddaway NR, Westgate MJ. Predicting the time needed for environmental systematic reviews and systematic maps. Conserv Biol. 2019;33(2):434.

Ahmad MA, Eckert C, Teredesai A. In: Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics; 2018. pp. 559–60.

Lundberg S, Lee SI. 2017. arXiv:1705.07874 .

Yu F, Ip HH. Semantic content analysis and annotation of histological images. Comput Biol Med. 2008;38(6):635.

Sagi O, Rokach L. Explainable decision forest: transforming a decision forest into an interpretable tree. Inf Fusion. 2020;61:124.

Stiglic G, Kocbek P, Fijacko N, Zitnik M, Verbert K, Cilar L. Interpretability of machine learning-based prediction models in healthcare. Wiley Interdiscipl Rev Data Min Knowl Disc. 2020;10(5):e1379.

Google Scholar  

Ribeiro MT, Singh S, Guestrin C. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. pp. 1135–44.

Rebane J, Samsten I, Papapetrou P. Exploiting complex medical data with interpretable deep learning for adverse drug event predictio. Artif Intell Med. 2020;109:101942.

Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, Mahendiran T, Moraes G, Shamdas M, Kern C, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Dig Health. 2019;1(6):e271.

Panayides AS, Amini A, Filipovic ND, Sharma A, Tsaftaris SA, Young A, Foran D, Do N, Golemati S, Kurc T, et al. AI in medical imaging informatics: current challenges and future directions. IEEE J Biomed Health Inf. 2020;24(7):1837.

Oktay O, Ferrante E, Kamnitsas K, Heinrich M, Bai W, Caballero J, Cook SA, De Marvao A, Dawes T, Oregan DP, et al. Anatomically constrained neural networks (ACNNs): application to cardiac image enhancement and segmentation. IEEE Trans Med Imaging. 2017;37(2):384.

Jeyaraj PR, Nadar ERS. Deep Boltzmann machine algorithm for accurate medical image analysis for classification of cancerous region. Cogn Comput Syst. 2019;1(3):85.

Harmon SA, Sanford TH, Xu S, Turkbey EB, Roth H, Xu Z, Yang D, Myronenko A, Anderson V, Amalou A, et al. A systematic review of antibody mediated immunity to coronaviruses: kinetics, correlates of protection, and association with severity. Nature Commun. 2020;11(1):1.

Zheng Y, Jiang Z, Zhang H, Xie F, Shi J, Xue C. Adaptive color deconvolution for histological WSI normalization. Comput Methods Progr Biomed. 2019;170:107.

Hoffman RA, Kothari S, Wang MD. In: 2014 36th annual international conference of the IEEE engineering in medicine and biology society, IEEE; 2014. pp. 194–7.

Feng S, et al. 2020. arXiv:2004.02731 .

Currie G, Hawk KE, Rohren E, Vial A, Klein R. Machine learning and deep learning in medical imaging: intelligent imaging. J Med Imaging Radiat Sci. 2019;50(4):477.

Xia W, Chen EC, Peters T. Endoscopic image enhancement with noise suppression. Healthcare Technol Lett. 2018;5(5):154.

Capotorti A. Probabilistic inconsistency correction for misclassification in statistical matching, with an example in health care. Int J Gener Syst. 2020;49(1):32.

Article   MathSciNet   Google Scholar  

Li JP, Haq AU, Din SU, Khan J, Khan A, Saboor A. Heart disease identification method using machine learning classification in e-healthcare. IEEE Access. 2020;8:107562.

Naydenova E, Tsanas A, Casals-Pascual C, De Vos M. In: 2015 IEEE global humanitarian technology conference (GHTC), IEEE; 2015. pp. 377–84.

Haq AU, Li JP, Memon MH, Malik A, Ahmad T, Ali A, Nazir S, Ahad I, Shahid M, et al. Feature selection based on L1-norm support vector machine and effective recognition system for Parkinson disease using voice recordings. IEEE Access. 2019;7:37718.

Zhang J, Lafta RL, Tao X, Li Y, Chen F, Luo Y, Zhu X. Coupling a fast fourier transformation with a machine learning ensemble model to support recommendations for heart disease patients in a telehealth environment. IEEE Access. 2017;5:10674.

Choi E, Xu Z, Li Y, Dusenberry M, Flores G, Xue E, Dai A. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34; 2020. pp. 606–13.

Pereira RG, Castro GZ, Azevedo P, Tôrres L, Zuppo I, Rocha T, Júnior AAG. In: 2020 IEEE 33rd international symposium on computer-based medical systems (CBMS), IEEE; 2020. pp. 1–6.

Kaur P, Sharma M, Mittal M. Big data and machine learning based secure healthcare framework. Procedia Comput Sci. 2018;132:1049.

Wang F, Zhu H, Liu X, Lu R, Hua J, Li H, Li H. Privacy-preserving collaborative model learning scheme for E-healthcare. IEEE Access. 2019;7:166054.

Wu Y, Jiang X, Kim J, Ohno-Machado L. Grid Binary LOgistic REgression (GLORE): building shared models without sharing data. J Am Med Inf Assoc. 2012;19(5):758.

Kuo TT, Ohno-Machado L. Modelchain: decentralized privacy-preserving healthcare predictive modeling framework on private blockchain networks. 2018.

Wang S, Jiang X, Wu Y, Cui L, Cheng S, Ohno-Machado L. Expectation propagation logistic regression (explorer): distributed privacy-preserving online model learning. J Biomed Inf. 2013;46(3):480.

Li Y, Jiang X, Wang S, Xiong H, Ohno-Machado L. Vertical grid logistic regression (vertigo). J Am Med Inf Assoc. 2016;23(3):570.

Fritchman K, Saminathan K, Dowsley R, Hughes T, De Cock M, Nascimento A, Teredesai A. In: 2018 IEEE international conference on big data (Big Data); 2018. pp. 2413–22. https://doi.org/10.1109/BigData.2018.8622627 .

Wang G, Lu J, Choi KS, Zhang G. A transfer-based additive LS-SVM classifier for handling missing data. IEEE Trans Cybern. 2018;50(2):739.

Dey N, Das H, Naik B, Behera HS. Big data analytics for intelligent healthcare management. Cambridge: Academic Press; 2019.

Chen Y, Qin X, Wang J, Yu C, Gao W, Chen Y, Qin X, Wang J, Yu C, Gao W. Fedhealth: a federated transfer learning framework for wearable healthcare. IEEE Intell Syst. 2020;35(4):83.

Download references

Author information

Authors and affiliations.

AIIT, AMITY University, Noida, Uttar Pradesh, India

Gaurav Parashar, Alka Chaudhary & Ajay Rana

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Gaurav Parashar .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Intelligent Systems” guest edited by Geetha Ganesan, Lalit Garg, Renu Dhir, Vijay Kumar and Manik Sharma.

Rights and permissions

Reprints and permissions

About this article

Parashar, G., Chaudhary, A. & Rana, A. Systematic Mapping Study of AI/Machine Learning in Healthcare and Future Directions. SN COMPUT. SCI. 2 , 461 (2021). https://doi.org/10.1007/s42979-021-00848-6

Download citation

Received : 04 August 2021

Accepted : 01 September 2021

Published : 16 September 2021

DOI : https://doi.org/10.1007/s42979-021-00848-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning (ML)
  • Transfer learning (TL)
  • Interpretable ML
  • Electronic health records (EHR)
  • Security framework
  • Privacy framework
  • Find a journal
  • Publish with us
  • Track your research

medRxiv

Identifying Psychosis Episodes in Psychiatric Admission Notes via Rule-based Methods, Machine Learning, and Pre-Trained Language Models

  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yining Hua
  • For correspondence: [email protected]
  • Info/History
  • Preview PDF

Early and accurate diagnosis is crucial for effective treatment and improved outcomes, yet identifying psychotic episodes presents significant challenges due to its complex nature and the varied presentation of symptoms among individuals. One of the primary difficulties lies in the underreporting and underdiagnosis of psychosis, compounded by the stigma surrounding mental health and the individuals' often diminished insight into their condition. Existing efforts leveraging Electronic Health Records (EHRs) to retrospectively identify psychosis typically rely on structured data, such as medical codes and patient demographics, which frequently lack essential information. Addressing these challenges, our study leverages Natural Language Processing (NLP) algorithms to analyze psychiatric admission notes for the diagnosis of psychosis, providing a detailed evaluation of rule-based algorithms, machine learning models, and pre-trained language models. Additionally, the study investigates the effectiveness of employing keywords to streamline extensive note data before training and evaluating the models. Analyzing 4,617 initial psychiatric admission notes (1,196 cases of psychosis versus 3,433 controls) from 2005 to 2019, we discovered that the XGBoost classifier employing Term Frequency-Inverse Document Frequency (TF-IDF) features derived from notes pre-selected by expert-curated keywords, attained the highest performance with an F1 score of 0.8881 (AUROC [95% CI]: 0.9725 [0.9717, 0.9733]). BlueBERT demonstrated comparable efficacy an F1 score of 0.8841 (AUROC [95% CI]: 0.97 [0.9580, 0.9820]) on the same set of notes. Both models markedly outperformed traditional International Classification of Diseases (ICD) code-based detection methods from discharge summaries, which had an F1 score of 0.7608, thus improving the margin by 0.12. Furthermore, our findings indicate that keyword pre-selection markedly enhances the performance of both machine learning and pre-trained language models. This study illustrates the potential of NLP techniques to improve psychosis detection within admission notes and aims to serve as a foundational reference for future research on applying NLP for psychosis identification in EHR notes.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

NIMH grant R01MH122427.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study was conducted at McLean Hospital, a psychiatric hospital in Belmont, Massachusetts, and a member of the Mass General Brigham (MGB) integrated healthcare system. All study activities were conducted with the approval of the MGB Human Research Committee (IRB) with a waiver of informed consent according to 54 U.S. Code of Federal Regulations 46.116.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Data Availability

We received a data sharing exemption for the NIH grant funding this study, as the Mass General Brigham Institutional Review Board deemed that individual subject-level data could not be shared as patients did not provide informed consent in accordance with waiver of consent for study, as well as increased protection of information on vulnerable populations, including patients with psychiatric disorders.

https://github.com/ningkko/psychosis_identification/tree/main

View the discussion thread.

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Reddit logo

Citation Manager Formats

  • EndNote (tagged)
  • EndNote 8 (xml)
  • RefWorks Tagged
  • Ref Manager
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Psychiatry and Clinical Psychology
  • Addiction Medicine (313)
  • Allergy and Immunology (614)
  • Anesthesia (157)
  • Cardiovascular Medicine (2225)
  • Dentistry and Oral Medicine (275)
  • Dermatology (199)
  • Emergency Medicine (366)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (789)
  • Epidemiology (11509)
  • Forensic Medicine (10)
  • Gastroenterology (673)
  • Genetic and Genomic Medicine (3511)
  • Geriatric Medicine (336)
  • Health Economics (609)
  • Health Informatics (2260)
  • Health Policy (906)
  • Health Systems and Quality Improvement (855)
  • Hematology (332)
  • HIV/AIDS (738)
  • Infectious Diseases (except HIV/AIDS) (13094)
  • Intensive Care and Critical Care Medicine (747)
  • Medical Education (355)
  • Medical Ethics (98)
  • Nephrology (383)
  • Neurology (3294)
  • Nursing (189)
  • Nutrition (502)
  • Obstetrics and Gynecology (643)
  • Occupational and Environmental Health (643)
  • Oncology (1735)
  • Ophthalmology (516)
  • Orthopedics (205)
  • Otolaryngology (283)
  • Pain Medicine (219)
  • Palliative Medicine (65)
  • Pathology (430)
  • Pediatrics (994)
  • Pharmacology and Therapeutics (415)
  • Primary Care Research (394)
  • Psychiatry and Clinical Psychology (3026)
  • Public and Global Health (5945)
  • Radiology and Imaging (1210)
  • Rehabilitation Medicine and Physical Therapy (706)
  • Respiratory Medicine (803)
  • Rheumatology (363)
  • Sexual and Reproductive Health (343)
  • Sports Medicine (307)
  • Surgery (381)
  • Toxicology (50)
  • Transplantation (169)
  • Urology (140)

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Perspective
  • Published: 06 March 2024

Artificial intelligence and illusions of understanding in scientific research

  • Lisa Messeri   ORCID: orcid.org/0000-0002-0964-123X 1   na1 &
  • M. J. Crockett   ORCID: orcid.org/0000-0001-8800-410X 2 , 3   na1  

Nature volume  627 ,  pages 49–58 ( 2024 ) Cite this article

19k Accesses

3 Citations

699 Altmetric

Metrics details

  • Human behaviour
  • Interdisciplinary studies
  • Research management
  • Social anthropology

Scientists are enthusiastically imagining ways in which artificial intelligence (AI) tools might improve research. Why are AI tools so attractive and what are the risks of implementing them across the research pipeline? Here we develop a taxonomy of scientists’ visions for AI, observing that their appeal comes from promises to improve productivity and objectivity by overcoming human shortcomings. But proposed AI solutions can also exploit our cognitive limitations, making us vulnerable to illusions of understanding in which we believe we understand more about the world than we actually do. Such illusions obscure the scientific community’s ability to see the formation of scientific monocultures, in which some types of methods, questions and viewpoints come to dominate alternative approaches, making science less innovative and more vulnerable to errors. The proliferation of AI tools in science risks introducing a phase of scientific enquiry in which we produce more but understand less. By analysing the appeal of these tools, we provide a framework for advancing discussions of responsible knowledge production in the age of AI.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

machine learning in healthcare research papers pdf

Similar content being viewed by others

machine learning in healthcare research papers pdf

Nobel Turing Challenge: creating the engine for scientific discovery

Hiroaki Kitano

machine learning in healthcare research papers pdf

Accelerating science with human-aware artificial intelligence

Jamshid Sourati & James A. Evans

machine learning in healthcare research papers pdf

On scientific understanding with artificial intelligence

Mario Krenn, Robert Pollice, … Alán Aspuru-Guzik

Crabtree, G. Self-driving laboratories coming of age. Joule 4 , 2538–2541 (2020).

Article   CAS   Google Scholar  

Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620 , 47–60 (2023). This review explores how AI can be incorporated across the research pipeline, drawing from a wide range of scientific disciplines .

Article   CAS   PubMed   ADS   Google Scholar  

Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can AI language models replace human participants? Trends Cogn. Sci. 27 , 597–600 (2023).

Article   PubMed   Google Scholar  

Grossmann, I. et al. AI and the transformation of social science research. Science 380 , 1108–1109 (2023). This forward-looking article proposes a variety of ways to incorporate generative AI into social-sciences research .

Gil, Y. Will AI write scientific papers in the future? AI Mag. 42 , 3–15 (2022).

Google Scholar  

Kitano, H. Nobel Turing Challenge: creating the engine for scientific discovery. npj Syst. Biol. Appl. 7 , 29 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Benjamin, R. Race After Technology: Abolitionist Tools for the New Jim Code (Oxford Univ. Press, 2020). This book examines how social norms about race become embedded in technologies, even those that are focused on providing good societal outcomes .

Broussard, M. More Than a Glitch: Confronting Race, Gender, and Ability Bias in Tech (MIT Press, 2023).

Noble, S. U. Algorithms of Oppression: How Search Engines Reinforce Racism (New York Univ. Press, 2018).

Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? in Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 610–623 (Association for Computing Machinery, 2021). One of the first comprehensive critiques of large language models, this article draws attention to a host of issues that ought to be considered before taking up such tools .

Crawford, K. Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence (Yale Univ. Press, 2021).

Johnson, D. G. & Verdicchio, M. Reframing AI discourse. Minds Mach. 27 , 575–590 (2017).

Article   Google Scholar  

Atanasoski, N. & Vora, K. Surrogate Humanity: Race, Robots, and the Politics of Technological Futures (Duke Univ. Press, 2019).

Mitchell, M. & Krakauer, D. C. The debate over understanding in AI’s large language models. Proc. Natl Acad. Sci. USA 120 , e2215907120 (2023).

Kidd, C. & Birhane, A. How AI can distort human beliefs. Science 380 , 1222–1223 (2023).

Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large language models. Nat. Rev. Phys. 5 , 277–280 (2023).

Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4 , 100804 (2023).

Hullman, J., Kapoor, S., Nanayakkara, P., Gelman, A. & Narayanan, A. The worst of both worlds: a comparative analysis of errors in learning from data in psychology and machine learning. In Proc. 2022 AAAI/ACM Conference on AI, Ethics, and Society (eds Conitzer, V. et al.) 335–348 (Association for Computing Machinery, 2022).

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1 , 206–215 (2019). This paper articulates the problems with attempting to explain AI systems that lack interpretability, and advocates for building interpretable models instead .

Crockett, M. J., Bai, X., Kapoor, S., Messeri, L. & Narayanan, A. The limitations of machine learning models for predicting scientific replicability. Proc. Natl Acad. Sci. USA 120 , e2307596120 (2023).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Lazar, S. & Nelson, A. AI safety on whose terms? Science 381 , 138 (2023).

Article   PubMed   ADS   Google Scholar  

Collingridge, D. The Social Control of Technology (St Martin’s Press, 1980).

Wagner, G., Lukyanenko, R. & Paré, G. Artificial intelligence and the conduct of literature reviews. J. Inf. Technol. 37 , 209–226 (2022).

Hutson, M. Artificial-intelligence tools aim to tame the coronavirus literature. Nature https://doi.org/10.1038/d41586-020-01733-7 (2020).

Haas, Q. et al. Utilizing artificial intelligence to manage COVID-19 scientific evidence torrent with Risklick AI: a critical tool for pharmacology and therapy development. Pharmacology 106 , 244–253 (2021).

Article   CAS   PubMed   Google Scholar  

Müller, H., Pachnanda, S., Pahl, F. & Rosenqvist, C. The application of artificial intelligence on different types of literature reviews – a comparative study. In 2022 International Conference on Applied Artificial Intelligence (ICAPAI) https://doi.org/10.1109/ICAPAI55158.2022.9801564 (Institute of Electrical and Electronics Engineers, 2022).

van Dinter, R., Tekinerdogan, B. & Catal, C. Automation of systematic literature reviews: a systematic literature review. Inf. Softw. Technol. 136 , 106589 (2021).

Aydın, Ö. & Karaarslan, E. OpenAI ChatGPT generated literature review: digital twin in healthcare. In Emerging Computer Technologies 2 (ed. Aydın, Ö.) 22–31 (İzmir Akademi Dernegi, 2022).

AlQuraishi, M. AlphaFold at CASP13. Bioinformatics 35 , 4862–4865 (2019).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589 (2021).

Article   CAS   PubMed   PubMed Central   ADS   Google Scholar  

Lee, J. S., Kim, J. & Kim, P. M. Score-based generative modeling for de novo protein design. Nat. Computat. Sci. 3 , 382–392 (2023).

Gómez-Bombarelli, R. et al. Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nat. Mater. 15 , 1120–1127 (2016).

Krenn, M. et al. On scientific understanding with artificial intelligence. Nat. Rev. Phys. 4 , 761–769 (2022).

Extance, A. How AI technology can tame the scientific literature. Nature 561 , 273–274 (2018).

Hastings, J. AI for Scientific Discovery (CRC Press, 2023). This book reviews current and future incorporation of AI into the scientific research pipeline .

Ahmed, A. et al. The future of academic publishing. Nat. Hum. Behav. 7 , 1021–1026 (2023).

Gray, K., Yam, K. C., Zhen’An, A. E., Wilbanks, D. & Waytz, A. The psychology of robots and artificial intelligence. In The Handbook of Social Psychology (eds Gilbert, D. et al.) (in the press).

Argyle, L. P. et al. Out of one, many: using language models to simulate human samples. Polit. Anal. 31 , 337–351 (2023).

Aher, G., Arriaga, R. I. & Kalai, A. T. Using large language models to simulate multiple humans and replicate human subject studies. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 337–371 (JMLR.org, 2023).

Binz, M. & Schulz, E. Using cognitive psychology to understand GPT-3. Proc. Natl Acad. Sci. USA 120 , e2218523120 (2023).

Ornstein, J. T., Blasingame, E. N. & Truscott, J. S. How to train your stochastic parrot: large language models for political texts. Github , https://joeornstein.github.io/publications/ornstein-blasingame-truscott.pdf (2023).

He, S. et al. Learning to predict the cosmological structure formation. Proc. Natl Acad. Sci. USA 116 , 13825–13832 (2019).

Article   MathSciNet   CAS   PubMed   PubMed Central   ADS   Google Scholar  

Mahmood, F. et al. Deep adversarial training for multi-organ nuclei segmentation in histopathology images. IEEE Trans. Med. Imaging 39 , 3257–3267 (2020).

Teixeira, B. et al. Generating synthetic X-ray images of a person from the surface geometry. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 9059–9067 (Institute of Electrical and Electronics Engineers, 2018).

Marouf, M. et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 11 , 166 (2020).

Watts, D. J. A twenty-first century science. Nature 445 , 489 (2007).

boyd, d. & Crawford, K. Critical questions for big data. Inf. Commun. Soc. 15 , 662–679 (2012). This article assesses the ethical and epistemic implications of scientific and societal moves towards big data and provides a parallel case study for thinking about the risks of artificial intelligence .

Jolly, E. & Chang, L. J. The Flatland fallacy: moving beyond low–dimensional thinking. Top. Cogn. Sci. 11 , 433–454 (2019).

Yarkoni, T. & Westfall, J. Choosing prediction over explanation in psychology: lessons from machine learning. Perspect. Psychol. Sci. 12 , 1100–1122 (2017).

Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10 , 221–227 (2013).

Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. 40 , 932–937 (2022).

Barkas, N. et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods 16 , 695–698 (2019).

Demszky, D. et al. Using large language models in psychology. Nat. Rev. Psychol. 2 , 688–701 (2023).

Karjus, A. Machine-assisted mixed methods: augmenting humanities and social sciences with artificial intelligence. Preprint at https://arxiv.org/abs/2309.14379 (2023).

Davies, A. et al. Advancing mathematics by guiding human intuition with AI. Nature 600 , 70–74 (2021).

Peterson, J. C., Bourgin, D. D., Agrawal, M., Reichman, D. & Griffiths, T. L. Using large-scale experiments and machine learning to discover theories of human decision-making. Science 372 , 1209–1214 (2021).

Ilyas, A. et al. Adversarial examples are not bugs, they are features. Preprint at https://doi.org/10.48550/arXiv.1905.02175 (2019)

Semel, B. M. Listening like a computer: attentional tensions and mechanized care in psychiatric digital phenotyping. Sci. Technol. Hum. Values 47 , 266–290 (2022).

Gil, Y. Thoughtful artificial intelligence: forging a new partnership for data science and scientific discovery. Data Sci. 1 , 119–129 (2017).

Checco, A., Bracciale, L., Loreti, P., Pinfield, S. & Bianchi, G. AI-assisted peer review. Humanit. Soc. Sci. Commun. 8 , 25 (2021).

Thelwall, M. Can the quality of published academic journal articles be assessed with machine learning? Quant. Sci. Stud. 3 , 208–226 (2022).

Dhar, P. Peer review of scholarly research gets an AI boost. IEEE Spectrum spectrum.ieee.org/peer-review-of-scholarly-research-gets-an-ai-boost (2020).

Heaven, D. AI peer reviewers unleashed to ease publishing grind. Nature 563 , 609–610 (2018).

Conroy, G. How ChatGPT and other AI tools could disrupt scientific publishing. Nature 622 , 234–236 (2023).

Nosek, B. A. et al. Replicability, robustness, and reproducibility in psychological science. Annu. Rev. Psychol. 73 , 719–748 (2022).

Altmejd, A. et al. Predicting the replicability of social science lab experiments. PLoS ONE 14 , e0225826 (2019).

Yang, Y., Youyou, W. & Uzzi, B. Estimating the deep replicability of scientific findings using human and artificial intelligence. Proc. Natl Acad. Sci. USA 117 , 10762–10768 (2020).

Youyou, W., Yang, Y. & Uzzi, B. A discipline-wide investigation of the replicability of psychology papers over the past two decades. Proc. Natl Acad. Sci. USA 120 , e2208863120 (2023).

Rabb, N., Fernbach, P. M. & Sloman, S. A. Individual representation in a community of knowledge. Trends Cogn. Sci. 23 , 891–902 (2019). This comprehensive review paper documents the empirical evidence for distributed cognition in communities of knowledge and the resultant vulnerabilities to illusions of understanding .

Rozenblit, L. & Keil, F. The misunderstood limits of folk science: an illusion of explanatory depth. Cogn. Sci. 26 , 521–562 (2002). This paper provided an empirical demonstration of the illusion of explanatory depth, and inspired a programme of research in cognitive science on communities of knowledge .

Hutchins, E. Cognition in the Wild (MIT Press, 1995).

Lave, J. & Wenger, E. Situated Learning: Legitimate Peripheral Participation (Cambridge Univ. Press, 1991).

Kitcher, P. The division of cognitive labor. J. Philos. 87 , 5–22 (1990).

Hardwig, J. Epistemic dependence. J. Philos. 82 , 335–349 (1985).

Keil, F. in Oxford Studies In Epistemology (eds Gendler, T. S. & Hawthorne, J.) 143–166 (Oxford Academic, 2005).

Weisberg, M. & Muldoon, R. Epistemic landscapes and the division of cognitive labor. Philos. Sci. 76 , 225–252 (2009).

Sloman, S. A. & Rabb, N. Your understanding is my understanding: evidence for a community of knowledge. Psychol. Sci. 27 , 1451–1460 (2016).

Wilson, R. A. & Keil, F. The shadows and shallows of explanation. Minds Mach. 8 , 137–159 (1998).

Keil, F. C., Stein, C., Webb, L., Billings, V. D. & Rozenblit, L. Discerning the division of cognitive labor: an emerging understanding of how knowledge is clustered in other minds. Cogn. Sci. 32 , 259–300 (2008).

Sperber, D. et al. Epistemic vigilance. Mind Lang. 25 , 359–393 (2010).

Wilkenfeld, D. A., Plunkett, D. & Lombrozo, T. Depth and deference: when and why we attribute understanding. Philos. Stud. 173 , 373–393 (2016).

Sparrow, B., Liu, J. & Wegner, D. M. Google effects on memory: cognitive consequences of having information at our fingertips. Science 333 , 776–778 (2011).

Fisher, M., Goddu, M. K. & Keil, F. C. Searching for explanations: how the internet inflates estimates of internal knowledge. J. Exp. Psychol. Gen. 144 , 674–687 (2015).

De Freitas, J., Agarwal, S., Schmitt, B. & Haslam, N. Psychological factors underlying attitudes toward AI tools. Nat. Hum. Behav. 7 , 1845–1854 (2023).

Castelo, N., Bos, M. W. & Lehmann, D. R. Task-dependent algorithm aversion. J. Mark. Res. 56 , 809–825 (2019).

Cadario, R., Longoni, C. & Morewedge, C. K. Understanding, explaining, and utilizing medical artificial intelligence. Nat. Hum. Behav. 5 , 1636–1642 (2021).

Oktar, K. & Lombrozo, T. Deciding to be authentic: intuition is favored over deliberation when authenticity matters. Cognition 223 , 105021 (2022).

Bigman, Y. E., Yam, K. C., Marciano, D., Reynolds, S. J. & Gray, K. Threat of racial and economic inequality increases preference for algorithm decision-making. Comput. Hum. Behav. 122 , 106859 (2021).

Claudy, M. C., Aquino, K. & Graso, M. Artificial intelligence can’t be charmed: the effects of impartiality on laypeople’s algorithmic preferences. Front. Psychol. 13 , 898027 (2022).

Snyder, C., Keppler, S. & Leider, S. Algorithm reliance under pressure: the effect of customer load on service workers. Preprint at SSRN https://doi.org/10.2139/ssrn.4066823 (2022).

Bogert, E., Schecter, A. & Watson, R. T. Humans rely more on algorithms than social influence as a task becomes more difficult. Sci Rep. 11 , 8028 (2021).

Raviv, A., Bar‐Tal, D., Raviv, A. & Abin, R. Measuring epistemic authority: studies of politicians and professors. Eur. J. Personal. 7 , 119–138 (1993).

Cummings, L. The “trust” heuristic: arguments from authority in public health. Health Commun. 29 , 1043–1056 (2014).

Lee, M. K. Understanding perception of algorithmic decisions: fairness, trust, and emotion in response to algorithmic management. Big Data Soc. 5 , https://doi.org/10.1177/2053951718756684 (2018).

Kissinger, H. A., Schmidt, E. & Huttenlocher, D. The Age of A.I. And Our Human Future (Little, Brown, 2021).

Lombrozo, T. Explanatory preferences shape learning and inference. Trends Cogn. Sci. 20 , 748–759 (2016). This paper provides an overview of philosophical theories of explanatory virtues and reviews empirical evidence on the sorts of explanations people find satisfying .

Vrantsidis, T. H. & Lombrozo, T. Simplicity as a cue to probability: multiple roles for simplicity in evaluating explanations. Cogn. Sci. 46 , e13169 (2022).

Johnson, S. G. B., Johnston, A. M., Toig, A. E. & Keil, F. C. Explanatory scope informs causal strength inferences. In Proc. 36th Annual Meeting of the Cognitive Science Society 2453–2458 (Cognitive Science Society, 2014).

Khemlani, S. S., Sussman, A. B. & Oppenheimer, D. M. Harry Potter and the sorcerer’s scope: latent scope biases in explanatory reasoning. Mem. Cognit. 39 , 527–535 (2011).

Liquin, E. G. & Lombrozo, T. Motivated to learn: an account of explanatory satisfaction. Cogn. Psychol. 132 , 101453 (2022).

Hopkins, E. J., Weisberg, D. S. & Taylor, J. C. V. The seductive allure is a reductive allure: people prefer scientific explanations that contain logically irrelevant reductive information. Cognition 155 , 67–76 (2016).

Weisberg, D. S., Hopkins, E. J. & Taylor, J. C. V. People’s explanatory preferences for scientific phenomena. Cogn. Res. Princ. Implic. 3 , 44 (2018).

Jerez-Fernandez, A., Angulo, A. N. & Oppenheimer, D. M. Show me the numbers: precision as a cue to others’ confidence. Psychol. Sci. 25 , 633–635 (2014).

Kim, J., Giroux, M. & Lee, J. C. When do you trust AI? The effect of number presentation detail on consumer trust and acceptance of AI recommendations. Psychol. Mark. 38 , 1140–1155 (2021).

Nguyen, C. T. The seductions of clarity. R. Inst. Philos. Suppl. 89 , 227–255 (2021). This article describes how reductive and quantitative explanations can generate a sense of understanding that is not necessarily correlated with actual understanding .

Fisher, M., Smiley, A. H. & Grillo, T. L. H. Information without knowledge: the effects of internet search on learning. Memory 30 , 375–387 (2022).

Eliseev, E. D. & Marsh, E. J. Understanding why searching the internet inflates confidence in explanatory ability. Appl. Cogn. Psychol. 37 , 711–720 (2023).

Fisher, M. & Oppenheimer, D. M. Who knows what? Knowledge misattribution in the division of cognitive labor. J. Exp. Psychol. Appl. 27 , 292–306 (2021).

Chromik, M., Eiband, M., Buchner, F., Krüger, A. & Butz, A. I think I get your point, AI! The illusion of explanatory depth in explainable AI. In 26th International Conference on Intelligent User Interfaces (eds Hammond, T. et al.) 307–317 (Association for Computing Machinery, 2021).

Strevens, M. No understanding without explanation. Stud. Hist. Philos. Sci. A 44 , 510–515 (2013).

Ylikoski, P. in Scientific Understanding: Philosophical Perspectives (eds De Regt, H. et al.) 100–119 (Univ. Pittsburgh Press, 2009).

Giudice, M. D. The prediction–explanation fallacy: a pervasive problem in scientific applications of machine learning. Preprint at PsyArXiv https://doi.org/10.31234/osf.io/4vq8f (2021).

Hofman, J. M. et al. Integrating explanation and prediction in computational social science. Nature 595 , 181–188 (2021). This paper highlights the advantages and disadvantages of explanatory versus predictive approaches to modelling, with a focus on applications to computational social science .

Shmueli, G. To explain or to predict? Stat. Sci. 25 , 289–310 (2010).

Article   MathSciNet   Google Scholar  

Hofman, J. M., Sharma, A. & Watts, D. J. Prediction and explanation in social systems. Science 355 , 486–488 (2017).

Logg, J. M., Minson, J. A. & Moore, D. A. Algorithm appreciation: people prefer algorithmic to human judgment. Organ. Behav. Hum. Decis. Process. 151 , 90–103 (2019).

Nguyen, C. T. Cognitive islands and runaway echo chambers: problems for epistemic dependence on experts. Synthese 197 , 2803–2821 (2020).

Breiman, L. Statistical modeling: the two cultures. Stat. Sci. 16 , 199–215 (2001).

Gao, J. & Wang, D. Quantifying the benefit of artificial intelligence for scientific research. Preprint at arxiv.org/abs/2304.10578 (2023).

Hanson, B. et al. Garbage in, garbage out: mitigating risks and maximizing benefits of AI in research. Nature 623 , 28–31 (2023).

Kleinberg, J. & Raghavan, M. Algorithmic monoculture and social welfare. Proc. Natl Acad. Sci. USA 118 , e2018340118 (2021). This paper uses formal modelling methods to demonstrate that when companies all rely on the same algorithm to make decisions (an algorithmic monoculture), the overall quality of those decisions is reduced because valuable options can slip through the cracks, even when the algorithm performs accurately for individual companies .

Article   MathSciNet   CAS   PubMed   PubMed Central   Google Scholar  

Hofstra, B. et al. The diversity–innovation paradox in science. Proc. Natl Acad. Sci. USA 117 , 9284–9291 (2020).

Hong, L. & Page, S. E. Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proc. Natl Acad. Sci. USA 101 , 16385–16389 (2004).

Page, S. E. Where diversity comes from and why it matters? Eur. J. Soc. Psychol. 44 , 267–279 (2014). This article reviews research demonstrating the benefits of cognitive diversity and diversity in methodological approaches for problem solving and innovation .

Clarke, A. E. & Fujimura, J. H. (eds) The Right Tools for the Job: At Work in Twentieth-Century Life Sciences (Princeton Univ. Press, 2014).

Silva, V. J., Bonacelli, M. B. M. & Pacheco, C. A. Framing the effects of machine learning on science. AI Soc. https://doi.org/10.1007/s00146-022-01515-x (2022).

Sassenberg, K. & Ditrich, L. Research in social psychology changed between 2011 and 2016: larger sample sizes, more self-report measures, and more online studies. Adv. Methods Pract. Psychol. Sci. 2 , 107–114 (2019).

Simon, A. F. & Wilder, D. Methods and measures in social and personality psychology: a comparison of JPSP publications in 1982 and 2016. J. Soc. Psychol. https://doi.org/10.1080/00224545.2022.2135088 (2022).

Anderson, C. A. et al. The MTurkification of social and personality psychology. Pers. Soc. Psychol. Bull. 45 , 842–850 (2019).

Latour, B. in The Social After Gabriel Tarde: Debates and Assessments (ed. Candea, M.) 145–162 (Routledge, 2010).

Porter, T. M. Trust in Numbers: The Pursuit of Objectivity in Science and Public Life (Princeton Univ. Press, 1996).

Lazer, D. et al. Meaningful measures of human society in the twenty-first century. Nature 595 , 189–196 (2021).

Knox, D., Lucas, C. & Cho, W. K. T. Testing causal theories with learned proxies. Annu. Rev. Polit. Sci. 25 , 419–441 (2022).

Barberá, P. Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Polit. Anal. 23 , 76–91 (2015).

Brady, W. J., McLoughlin, K., Doan, T. N. & Crockett, M. J. How social learning amplifies moral outrage expression in online social networks. Sci. Adv. 7 , eabe5641 (2021).

Article   PubMed   PubMed Central   ADS   Google Scholar  

Barnes, J., Klinger, R. & im Walde, S. S. Assessing state-of-the-art sentiment models on state-of-the-art sentiment datasets. In Proc. 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (eds Balahur, A. et al.) 2–12 (Association for Computational Linguistics, 2017).

Gitelman, L. (ed.) “Raw Data” is an Oxymoron (MIT Press, 2013).

Breznau, N. et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proc. Natl Acad. Sci. USA 119 , e2203150119 (2022). This study demonstrates how 73 research teams analysing the same dataset reached different conclusions about the relationship between immigration and public support for social policies, highlighting the subjectivity and uncertainty involved in analysing complex datasets .

Gillespie, T. in Media Technologies: Essays on Communication, Materiality, and Society (eds Gillespie, T. et al.) 167–194 (MIT Press, 2014).

Leonelli, S. Data-Centric Biology: A Philosophical Study (Univ. Chicago Press, 2016).

Wang, A., Kapoor, S., Barocas, S. & Narayanan, A. Against predictive optimization: on the legitimacy of decision-making algorithms that optimize predictive accuracy. ACM J. Responsib. Comput. , https://doi.org/10.1145/3636509 (2023).

Athey, S. Beyond prediction: using big data for policy problems. Science 355 , 483–485 (2017).

del Rosario Martínez-Ordaz, R. Scientific understanding through big data: from ignorance to insights to understanding. Possibility Stud. Soc. 1 , 279–299 (2023).

Nussberger, A.-M., Luo, L., Celis, L. E. & Crockett, M. J. Public attitudes value interpretability but prioritize accuracy in artificial intelligence. Nat. Commun. 13 , 5821 (2022).

Zittrain, J. in The Cambridge Handbook of Responsible Artificial Intelligence: Interdisciplinary Perspectives (eds. Voeneky, S. et al.) 176–184 (Cambridge Univ. Press, 2022). This article articulates the epistemic risks of prioritizing predictive accuracy over explanatory understanding when AI tools are interacting in complex systems.

Shumailov, I. et al. The curse of recursion: training on generated data makes models forget. Preprint at arxiv.org/abs/2305.17493 (2023).

Latour, B. Science In Action: How to Follow Scientists and Engineers Through Society (Harvard Univ. Press, 1987). This book provides strategies and approaches for thinking about science as a social endeavour .

Franklin, S. Science as culture, cultures of science. Annu. Rev. Anthropol. 24 , 163–184 (1995).

Haraway, D. Situated knowledges: the science question in feminism and the privilege of partial perspective. Fem. Stud. 14 , 575–599 (1988). This article acknowledges that the objective ‘view from nowhere’ is unobtainable: knowledge, it argues, is always situated .

Harding, S. Objectivity and Diversity: Another Logic of Scientific Research (Univ. Chicago Press, 2015).

Longino, H. E. Science as Social Knowledge: Values and Objectivity in Scientific Inquiry (Princeton Univ. Press, 1990).

Daston, L. & Galison, P. Objectivity (Princeton Univ. Press, 2007). This book is a historical analysis of the shifting modes of ‘objectivity’ that scientists have pursued, arguing that objectivity is not a universal concept but that it shifts alongside scientific techniques and ambitions .

Prescod-Weinstein, C. Making Black women scientists under white empiricism: the racialization of epistemology in physics. Signs J. Women Cult. Soc. 45 , 421–447 (2020).

Mavhunga, C. What Do Science, Technology, and Innovation Mean From Africa? (MIT Press, 2017).

Schiebinger, L. The Mind Has No Sex? Women in the Origins of Modern Science (Harvard Univ. Press, 1991).

Martin, E. The egg and the sperm: how science has constructed a romance based on stereotypical male–female roles. Signs J. Women Cult. Soc. 16 , 485–501 (1991). This case study shows how assumptions about gender affect scientific theories, sometimes delaying the articulation of what might be considered to be more accurate descriptions of scientific phenomena .

Harding, S. Rethinking standpoint epistemology: What is “strong objectivity”? Centen. Rev. 36 , 437–470 (1992). In this article, Harding outlines her position on ‘strong objectivity’, by which clearly articulating one’s standpoint can lead to more robust knowledge claims .

Oreskes, N. Why Trust Science? (Princeton Univ. Press, 2019). This book introduces the reader to 20 years of scholarship in science and technology studies, arguing that the tools the discipline has for understanding science can help to reinstate public trust in the institution .

Rolin, K., Koskinen, I., Kuorikoski, J. & Reijula, S. Social and cognitive diversity in science: introduction. Synthese 202 , 36 (2023).

Hong, L. & Page, S. E. Problem solving by heterogeneous agents. J. Econ. Theory 97 , 123–163 (2001).

Sulik, J., Bahrami, B. & Deroy, O. The diversity gap: when diversity matters for knowledge. Perspect. Psychol. Sci. 17 , 752–767 (2022).

Lungeanu, A., Whalen, R., Wu, Y. J., DeChurch, L. A. & Contractor, N. S. Diversity, networks, and innovation: a text analytic approach to measuring expertise diversity. Netw. Sci. 11 , 36–64 (2023).

AlShebli, B. K., Rahwan, T. & Woon, W. L. The preeminence of ethnic diversity in scientific collaboration. Nat. Commun. 9 , 5163 (2018).

Campbell, L. G., Mehtani, S., Dozier, M. E. & Rinehart, J. Gender-heterogeneous working groups produce higher quality science. PLoS ONE 8 , e79147 (2013).

Nielsen, M. W., Bloch, C. W. & Schiebinger, L. Making gender diversity work for scientific discovery and innovation. Nat. Hum. Behav. 2 , 726–734 (2018).

Yang, Y., Tian, T. Y., Woodruff, T. K., Jones, B. F. & Uzzi, B. Gender-diverse teams produce more novel and higher-impact scientific ideas. Proc. Natl Acad. Sci. USA 119 , e2200841119 (2022).

Kozlowski, D., Larivière, V., Sugimoto, C. R. & Monroe-White, T. Intersectional inequalities in science. Proc. Natl Acad. Sci. USA 119 , e2113067119 (2022).

Fehr, C. & Jones, J. M. Culture, exploitation, and epistemic approaches to diversity. Synthese 200 , 465 (2022).

Nakadai, R., Nakawake, Y. & Shibasaki, S. AI language tools risk scientific diversity and innovation. Nat. Hum. Behav. 7 , 1804–1805 (2023).

National Academies of Sciences, Engineering, and Medicine et al. Advancing Antiracism, Diversity, Equity, and Inclusion in STEMM Organizations: Beyond Broadening Participation (National Academies Press, 2023).

Winner, L. Do artifacts have politics? Daedalus 109 , 121–136 (1980).

Eubanks, V. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor (St. Martin’s Press, 2018).

Littmann, M. et al. Validity of machine learning in biology and medicine increased through collaborations across fields of expertise. Nat. Mach. Intell. 2 , 18–24 (2020).

Carusi, A. et al. Medical artificial intelligence is as much social as it is technological. Nat. Mach. Intell. 5 , 98–100 (2023).

Raghu, M. & Schmidt, E. A survey of deep learning for scientific discovery. Preprint at arxiv.org/abs/2003.11755 (2020).

Bishop, C. AI4Science to empower the fifth paradigm of scientific discovery. Microsoft Research Blog www.microsoft.com/en-us/research/blog/ai4science-to-empower-the-fifth-paradigm-of-scientific-discovery/ (2022).

Whittaker, M. The steep cost of capture. Interactions 28 , 50–55 (2021).

Liesenfeld, A., Lopez, A. & Dingemanse, M. Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators. In Proc. 5th International Conference on Conversational User Interfaces 1–6 (Association for Computing Machinery, 2023).

Chu, J. S. G. & Evans, J. A. Slowed canonical progress in large fields of science. Proc. Natl Acad. Sci. USA 118 , e2021636118 (2021).

Park, M., Leahey, E. & Funk, R. J. Papers and patents are becoming less disruptive over time. Nature 613 , 138–144 (2023).

Frith, U. Fast lane to slow science. Trends Cogn. Sci. 24 , 1–2 (2020). This article explains the epistemic risks of a hyperfocus on scientific productivity and explores possible avenues for incentivizing the production of higher-quality science on a slower timescale .

Stengers, I. Another Science is Possible: A Manifesto for Slow Science (Wiley, 2018).

Lake, B. M. & Baroni, M. Human-like systematic generalization through a meta-learning neural network. Nature 623 , 115–121 (2023).

Feinman, R. & Lake, B. M. Learning task-general representations with generative neuro-symbolic modeling. Preprint at arxiv.org/abs/2006.14448 (2021).

Schölkopf, B. et al. Toward causal representation learning. Proc. IEEE 109 , 612–634 (2021).

Mitchell, M. AI’s challenge of understanding the world. Science 382 , eadm8175 (2023).

Sartori, L. & Bocca, G. Minding the gap(s): public perceptions of AI and socio-technical imaginaries. AI Soc. 38 , 443–458 (2023).

Download references

Acknowledgements

We thank D. S. Bassett, W. J. Brady, S. Helmreich, S. Kapoor, T. Lombrozo, A. Narayanan, M. Salganik and A. J. te Velthuis for comments. We also thank C. Buckner and P. Winter for their feedback and suggestions.

Author information

These authors contributed equally: Lisa Messeri, M. J. Crockett

Authors and Affiliations

Department of Anthropology, Yale University, New Haven, CT, USA

Lisa Messeri

Department of Psychology, Princeton University, Princeton, NJ, USA

M. J. Crockett

University Center for Human Values, Princeton University, Princeton, NJ, USA

You can also search for this author in PubMed   Google Scholar

Contributions

The authors contributed equally to the research and writing of the paper.

Corresponding authors

Correspondence to Lisa Messeri or M. J. Crockett .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Cameron Buckner, Peter Winter and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Messeri, L., Crockett, M.J. Artificial intelligence and illusions of understanding in scientific research. Nature 627 , 49–58 (2024). https://doi.org/10.1038/s41586-024-07146-0

Download citation

Received : 31 July 2023

Accepted : 31 January 2024

Published : 06 March 2024

Issue Date : 07 March 2024

DOI : https://doi.org/10.1038/s41586-024-07146-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Ai is no substitute for having something to say.

Nature Reviews Physics (2024)

Perché gli scienziati si fidano troppo dell'intelligenza artificiale - e come rimediare

Nature Italy (2024)

Why scientists trust AI too much — and what to do about it

Nature (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

machine learning in healthcare research papers pdf

This paper is in the following e-collection/theme issue:

Published on 18.3.2024 in Vol 10 (2024)

Predicting COVID-19 Vaccination Uptake Using a Small and Interpretable Set of Judgment and Demographic Variables: Cross-Sectional Cognitive Science Study

Authors of this article:

Author Orcid Image

Original Paper

  • Nicole L Vike 1 , PhD   ; 
  • Sumra Bari 1 , PhD   ; 
  • Leandros Stefanopoulos 2, 3 * , MSc   ; 
  • Shamal Lalvani 2 * , MSc   ; 
  • Byoung Woo Kim 1 * , MSc   ; 
  • Nicos Maglaveras 3 , PhD   ; 
  • Martin Block 4 , PhD   ; 
  • Hans C Breiter 1, 5 , MD   ; 
  • Aggelos K Katsaggelos 2, 6, 7 , PhD  

1 Department of Computer Science, University of Cincinnati, Cincinnati, OH, United States

2 Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, United States

3 School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece

4 Integrated Marketing Communications, Medill School, Northwestern University, Evanston, IL, United States

5 Department of Psychiatry, Massachusetts General Hospital, Harvard School of Medicine, Boston, MA, United States

6 Department of Computer Science, Northwestern University, Evanston, IL, United States

7 Department of Radiology, Northwestern University, Evanston, IL, United States

*these authors contributed equally

Corresponding Author:

Hans C Breiter, MD

Department of Computer Science

University of Cincinnati

2901 Woodside Drive

Cincinnati, OH, 45219

United States

Phone: 1 617 413 0953

Email: [email protected]

Background: Despite COVID-19 vaccine mandates, many chose to forgo vaccination, raising questions about the psychology underlying how judgment affects these choices. Research shows that reward and aversion judgments are important for vaccination choice; however, no studies have integrated such cognitive science with machine learning to predict COVID-19 vaccine uptake .

Objective: This study aims to determine the predictive power of a small but interpretable set of judgment variables using 3 machine learning algorithms to predict COVID-19 vaccine uptake and interpret what profile of judgment variables was important for prediction.

Methods: We surveyed 3476 adults across the United States in December 2021. Participants answered demographic, COVID-19 vaccine uptake (ie, whether participants were fully vaccinated), and COVID-19 precaution questions. Participants also completed a picture-rating task using images from the International Affective Picture System. Images were rated on a Likert-type scale to calibrate the degree of liking and disliking. Ratings were computationally modeled using relative preference theory to produce a set of graphs for each participant (minimum R 2 >0.8). In total, 15 judgment features were extracted from these graphs, 2 being analogous to risk and loss aversion from behavioral economics. These judgment variables, along with demographics, were compared between those who were fully vaccinated and those who were not. In total, 3 machine learning approaches (random forest, balanced random forest [BRF], and logistic regression) were used to test how well judgment, demographic, and COVID-19 precaution variables predicted vaccine uptake . Mediation and moderation were implemented to assess statistical mechanisms underlying successful prediction.

Results: Age, income, marital status, employment status, ethnicity, educational level, and sex differed by vaccine uptake (Wilcoxon rank sum and chi-square P <.001). Most judgment variables also differed by vaccine uptake (Wilcoxon rank sum P <.05). A similar area under the receiver operating characteristic curve (AUROC) was achieved by the 3 machine learning frameworks, although random forest and logistic regression produced specificities between 30% and 38% (vs 74.2% for BRF), indicating a lower performance in predicting unvaccinated participants. BRF achieved high precision (87.8%) and AUROC (79%) with moderate to high accuracy (70.8%) and balanced recall (69.6%) and specificity (74.2%). It should be noted that, for BRF, the negative predictive value was <50% despite good specificity. For BRF and random forest, 63% to 75% of the feature importance came from the 15 judgment variables. Furthermore, age, income, and educational level mediated relationships between judgment variables and vaccine uptake .

Conclusions: The findings demonstrate the underlying importance of judgment variables for vaccine choice and uptake, suggesting that vaccine education and messaging might target varying judgment profiles to improve uptake. These methods could also be used to aid vaccine rollouts and health care preparedness by providing location-specific details (eg, identifying areas that may experience low vaccination and high hospitalization).

Introduction

In early 2020, the COVID-19 pandemic wreaked havoc worldwide, triggering rapid vaccine development efforts. Despite federal, state, and workplace vaccination mandates, many individuals made judgments against COVID-19 vaccination, leading researchers to study the psychology underlying individual vaccination preferences and what might differentiate the framework for judgment between individuals who were fully vaccinated against COVID-19 and those who were not (henceforth referred to as vaccine uptake ). A better understanding of these differences in judgment may highlight targets for public messaging and education to increase the incidence of choosing vaccination.

Multiple studies have sought to predict an individual’s intention to receive a COVID-19 vaccine or specific variables underlying vaccination choices or mitigation strategies [ 1 - 7 ], but few have predicted vaccine uptake . One such study used 83 sociodemographic variables (with education, ethnicity, internet access, income, longitude, and latitude being the most important predictors) to predict vaccine uptake with 62% accuracy [ 8 ], confirming both the importance and limitations of these variables in prediction models. Other studies have compared demographic groups between vaccinated and nonvaccinated persons; Bulusu et al [ 9 ] found that young adults (aged 18-35 years), women, and those with higher levels of education had higher odds of being vaccinated. In a study of >12 million persons, the largest percentage of those who initiated COVID-19 vaccination were White, non-Hispanic women between the ages of 50 and 64 years [ 10 ]. Demographic variables are known to affect how individuals judge what is rewarding or aversive [ 11 , 12 ] yet are not themselves variables quantifying how individuals make judgments that then frame decisions.

Judgment reflects an individual’s preferences, or the variable extent to which they approach or avoid events in the world based on the rewarding or aversive effects of these events [ 13 - 15 ]. The definition of preference in psychology differs from that in economics. In psychology, preferences are associated with “wanting” and “liking” and are framed by judgments that precede decisions, which can be quantified through reinforcement reward or incentive reward tasks [ 12 , 16 - 21 ]. In economics, preferences are relations derived from consumer choice data (refer to the axioms of revealed preference [ 22 ]) and reflect choices or decisions based on judgments that place value on behavioral options. Economist Paul Samuelson noted that decisions are “assumed to be correlative to desire or want” [ 23 ]. In this study, we focused on a set of variables that frame judgment, with the presumption that judgments precede choices [ 12 , 20 ]. Variables that frame judgment can be derived from tasks using operant key-pressing tasks that quantify “wanting” [ 24 - 33 ] or simple rating tasks that are analogous to “liking” [ 20 , 34 ]. Both operant keypress and rating tasks measure variables that quantify the average (mean) magnitude ( K ), variance ( σ ), and pattern (ie, Shannon entropy [ H ]) of reward and aversion judgments [ 35 ]. We refer to this methodology and the multiple relationships between these variables and features based on their graphical relationships as relative preference theory (RPT; Figure 1 ) [ 18 , 36 ]. RPT has been shown to produce discrete, recurrent, robust, and scalable relationships between judgment variables [ 37 ] that produce mechanistic models for prediction [ 33 ], and which have demonstrated relationships to brain circuitry [ 24 - 27 , 30 ] and psychiatric illness [ 28 ]. Of the graphs produced for RPT, 2 appear to resemble graphs derived with different variables in economics, namely, prospect theory [ 38 ] and the mean-variance function for portfolio theory described by Markowitz [ 39 ]. Given this graphical resemblance, it is important to note that RPT functions quantifying value are not the same as standard representations of preference in economics. Behavioral economic variables such as loss aversion and risk aversion [ 38 , 40 - 51 ] are not to be interpreted in the same context given that both reflect biases and bounds to human rationality. In psychology, they are grounded in judgments that precede decisions, whereas in economics, they are grounded in consumer decisions themselves. Going forward, we will focus on judgment-based loss aversion, representing the overweighting of negative judgments relative to positive ones, and judgment-based risk aversion, representing the preference for small but certain assessments over larger but less certain ones (ie, assessments that have more variance associated with them) [ 38 , 40 - 51 ]. Herein, loss aversion and risk aversion refer to ratings or judgments that precede decisions.

A number of studies have described how risk aversion and other judgment variables are important for individual vaccine choices and hesitancies [ 52 - 58 ]. Hudson and Montelpare [ 54 ] found that risk aversion may promote vaccine adherence when people perceive contracting a disease as more dangerous or likely. Trueblood et al [ 52 ] noticed that those who were more risk seeking (as measured via a gamble ladder task) were more likely to receive the vaccine even if the vaccine was described as expedited. Wagner et al [ 53 ] described how risk misperceptions (when the actual risk does not align with the perceived risk) may result from a combination of cognitive biases, including loss aversion. A complex theoretical model using historical vaccine attitudes grounded in decision-making has also been proposed to predict COVID-19 vaccination, but this model has not yet been tested [ 59 ]. To our knowledge, no study has assessed how well a model comprising variables that reflect reward and aversion judgments predicts vaccine uptake .

machine learning in healthcare research papers pdf

Goal of This Study

Given the many vaccine-related issues that occurred during the COVID-19 pandemic (eg, vaccine shortages, hospital overload, and vaccination resistance or hesitancy), it is critical to develop methods that might improve planning around such shortcomings. Because judgment variables are fundamental to vaccine choice, they provide a viable target for predicting vaccine uptake . In addition, the rating methodology used to quantify variables of judgment is independent of methods quantifying vaccine uptake or intent to vaccinate, limiting response biases within the study data.

In this study, we aimed to predict COVID-19 vaccine uptake using judgment, demographic, and COVID-19 precaution (ie, behaviors minimizing potential exposure to COVID-19) variables using multiple machine learning algorithms, including logistic regression, random forest, and balanced random forest (BRF). BRF was hypothesized to perform best given its potential benefits with handling class imbalances [ 60 ], balancing both recall and specificity, and producing Gini scores that provide relative variable importance to prediction. In this study, the need for data imbalance techniques was motivated by the importance of the specificity metric, which would reflect the proportion of participants who did not receive full vaccination; without balancing, the model might not achieve similar recall and specificity values. When there is a large difference between recall and specificity, specificity might instead reflect the size of the minority class (those who did not receive full vaccination). In general, random forest approaches have been reported to have benefits over other approaches such as principal component analysis and neural networks, in which the N-dimensional feature space or layers (in the case of neural networks) are complex nonlinear functions, making it difficult to interpret variable importance and relationships to the outcome variable. To provide greater certainty about these assumptions, we performed logistic regression in parallel with random forest and BRF. The 3 machine learning approaches used a small feature set (<20) with interpretable relationships to the predicted variable. Such interpretations may not be achievable in big data approaches that use hundreds to thousands of variables that seemingly add little significance to the prediction models. Interpretation was facilitated by (1) the Gini importance criterion associated with BRF and random forest, which provided a profile of the judgment variables most important for prediction; and (2) mediation and moderation analyses that offered insights into statistical mechanisms among judgment variables, demographic (contextual) variables, and vaccine uptake . Determining whether judgment variables are predictive of COVID-19 vaccine uptake and defining which demographic variables facilitate this prediction presents a number of behavioral targets for vaccine education and messaging—and potentially identifies actionable targets for increasing vaccine uptake .

More broadly, the prediction of vaccine uptake may aid (1) vaccine supply chain and administration logistics by indicating areas that may need more or fewer vaccines, (2) targeted governmental messaging to locations with low predicted uptake, and (3) preparation of areas that may experience high cases of infection that could ultimately impact health care preparedness and infrastructure. The proposed method could also be applied to other mandated or government-recommended vaccines (eg, influenza and human papillomavirus) to facilitate the aforementioned logistics. Locally, vaccine uptake prediction could facilitate local messaging and prepare health care institutions for vaccine rollout and potential hospital overload. Nationally, prediction might inform public health officials and government communication bodies that are responsible for messaging and vaccine rollout with the goal of improving vaccine uptake and limiting infection and hospital overload.

Recruitment

Similar recruitment procedures for a smaller population-based study have been described previously [ 61 - 63 ]. In this study, participants were randomly sampled from the general US population using an email survey database used by Gold Research, Inc. Gold Research administered questionnaires in December 2021 using recruitment formats such as (1) customer databases from large companies that participate in revenue-sharing agreements, (2) social media, and (3) direct mail. Recruited participants followed a double opt-in consent procedure that included primary participation in the study as well as secondary use of anonymized, deidentified data (ie, all identifying information was removed by Gold Research before retrieval by the research group) in secondary analyses (refer to the Ethical Considerations section for more detail). During consent procedures, participants provided demographic information (eg, age, race, and sex) to ensure that the sampled participants adequately represented the US census at the time of the survey (December 2021). Respondents were also presented with repeated test questions to screen out those providing random and illogical responses or showing flatline or speeder behavior. Participants who provided such data were flagged, and their data were removed.

Because other components of the survey required an adequate sample of participants with mental health conditions, Gold Research oversampled 15% (60,000/400,000) of the sample for mental health conditions, and >400,000 respondents were contacted to complete the questionnaire. Gold Research estimated that, of the 400,000 participants, >300,000 (>75%) either did not respond or declined to participate. Of the remaining 25% (100,000/400,000) who clicked on the survey link, >50% (52,000/100,000) did not fully complete the questionnaire. Of the ≥48,000 participants who completed the survey (ie, ≥48,000/400,000, ≥12% of the initial pool of queried persons), those who did not clear data integrity assessments were omitted. Participants who met quality assurance procedures (refer to the following section) were selected, with a limit of 4000 to 4050 total participants.

Eligible participants were required to be aged between 18 and 70 years at the time of the survey, comprehend the English language, and have access to an electronic device (eg, laptop or smartphone).

Ethical Considerations

All participants provided informed consent, which included their primary participation in the study as well as the secondary use of their anonymized, deidentified data (ie, all identifying information removed by Gold Research before retrieval by the research group) in secondary analyses. This study was approved by the Northwestern University institutional review board (approval STU00213665) for the initial project start and later by the University of Cincinnati institutional review board (approval 2023-0164) as some Northwestern University investigators moved to the University of Cincinnati. All study approvals were in accordance with the Declaration of Helsinki. All participants were compensated with US $10 for taking part. Detailed survey instructions have been published previously [ 61 - 63 ].

Quality Assurance and Data Exclusion

Three additional quality assurance measures were used to flag nonadhering participants: (1) participants who indicated that they had ≥10 clinician-diagnosed illnesses (refer to Figure S1 in Multimedia Appendix 1 [ 18 , 33 , 36 , 64 - 68 ] for a list), (2) participants who showed minimal variance in the picture-rating task (ie, all pictures were rated the same or the ratings varied only by 1 point; refer to the Picture-Rating Task section), and (3) inconsistencies between educational level and years of education and participants who completed the questionnaire in <800 seconds.

Data from 4019 participants who passed the initial data integrity assessments were anonymized and then sent to the research team. Data were further excluded if the quantitative feature set derived from the picture-rating task was incomplete or if there were extreme outliers (refer to the RPT Framework section). Using these exclusion criteria, of the 4019 participants, 3476 (86.49%) were cleared for statistical analysis, representing 0.87% (3476/400,000) of the initial recruitment pool. A flowchart of participant exclusion is shown in Figure 2 .

machine learning in healthcare research papers pdf

Questionnaire

Participants were asked to report their age, sex, ethnicity, annual household income, marital status, employment status, and educational level. Participants were asked to report whether they had received the full vaccination ( yes or no responses). At the time of the survey, participants were likely to have received either 2 doses of the Pfizer or Moderna vaccine or 1 dose of the Johnson & Johnson vaccine as per the Centers for Disease Control and Prevention guidelines. Participants were also asked to respond yes (they routinely followed the precaution) or no (they did not routinely follow the precaution) to 4 COVID-19 precaution behaviors: mask wearing, social distancing, washing or sanitizing hands, and not gathering in large groups (refer to Tables S1 and S2 in Multimedia Appendix 1 for the complete questions and sample sizes, respectively). In addition, participants completed a picture-rating task at 2 points during the survey (refer to the Picture-Rating Task section).

Picture-Rating Task

A picture-rating task was administered to quantify participants’ degree of liking and disliking a validated picture set using pictures calibrated over large samples for their emotional intensity and valence [ 69 , 70 ]. Ratings from this task have been mathematically modeled using RPT to define graphical features of reward and aversion judgments. Each feature quantifies a core aspect of judgment, including risk aversion and loss aversion. Judgment variables have been shown to meet the criteria for lawfulness [ 37 ] that produce mechanistic models for prediction [ 33 ], with published relationships to brain circuitry [ 24 - 27 , 30 ] and psychiatric illness [ 28 ]. A more complete description of these judgment variables and their computation can be found in the RPT Framework section and in Table 1 .

For this task, participants were shown 48 unique color images from the International Affective Picture System [ 69 , 70 ]. A total of 6 picture categories were used: sports, disasters, cute animals, aggressive animals, nature (beach vs mountains), and men and women dressed minimally, with 8 pictures per category (48 pictures in total; Figure 1 A). These images have been used and validated in research on human emotion, attention, and preferences [ 69 , 70 ]. The images were displayed on the participants’ digital devices with a maximum size of 1204 × 768 pixels. Below each picture was a rating scale from −3 ( dislike very much ) to +3 ( like very much ), where 0 indicated indifference ( Figure 1 A). While there was no time limit for selecting a picture rating, participants were asked to rate the images as quickly as possible and use their first impression. Once a rating was selected, the next image was displayed.

RPT Framework

Ratings from the picture-rating task were analyzed using an RPT framework. This framework fits approach and avoidance curves and derives mathematical features from graphical plots ( Figures 1 B-1D). These methods have been described at length in prior work and are briefly described in this section [ 11 , 18 , 33 , 36 ]. More complete descriptions and quality assurance procedures can be found in Multimedia Appendix 1 .

At least 15 judgment variables can be mathematically derived from this framework and are psychologically interpretable; they have been validated using both operant keypress [ 9 , 25 - 27 ] and picture-rating tasks [ 11 , 34 ]. The 15 judgment variables are loss aversion, risk aversion, loss resilience, ante, insurance, peak positive risk, peak negative risk, reward tipping point, aversion tipping point, total reward risk, total aversion risk, reward-aversion trade-off, trade-off range, reward-aversion consistency, and consistency range. Loss aversion, risk aversion, loss resilience, ante, and insurance are derived from the logarithmic or power-law fit of mean picture ratings ( K ) versus entropy of ratings ( H ); this is referred to as the value function ( Figure 1 B). Peak positive risk, peak negative risk, reward tipping point, aversion tipping point, total reward risk, and total aversion risk are derived from the quadratic fit of K versus the SD of picture ratings ( σ ); this is referred to as the limit function ( Figure 1 C). Risk aversion trade-off, trade-off range, risk aversion consistency, and consistency range are derived from the radial fit of the pattern of avoidance judgments ( H − ) versus the pattern of approach judgments ( H + ); this is referred to as the trade-off function ( Figure 1 D). Value (Figure S2A in Multimedia Appendix 1 ), limit (Figure S2B in Multimedia Appendix 1 ), and trade-off (Figure S2C in Multimedia Appendix 1 ) functions were plotted for 500 randomly sampled participants, and nonlinear curve fits were assessed for goodness of fit, yielding R 2 , adjusted R 2 , and the associated F statistic for all participants (Figure S2D in Multimedia Appendix 1 ). Only the logarithmic and quadratic fits are listed in Table S3 in Multimedia Appendix 1 . Each feature describes a quantitative component of a participant’s reward and aversion judgment (refer to Table 1 for abbreviated descriptions and Multimedia Appendix 1 for complete descriptions). Collectively, the 15 RPT features will be henceforth referred to as “judgment variables.” The summary statistics for these variables can be found in Table S3 in Multimedia Appendix 1 .

Statistical and Machine Learning Analyses

Wilcoxon rank sum tests, chi-square tests, and Gini importance plotting were performed in Stata (version 17; StataCorp) [ 72 ]. Machine learning algorithms were run in Python (version 3.9; Python Software Foundation) [ 73 ], where the scikit-learn (version 1.2.2) [ 74 ] and imbalanced-learn (version 0.10.1) [ 75 ] libraries were used. Post hoc mediation and moderation analyses were performed in R (version 4.2.0; R Foundation for Statistical Computing) [ 76 ].

Demographic and Judgment Variable Differences by Vaccination Uptake

Each of the 7 demographic variables (age, income, marital status, employment status, ethnicity, educational level, and sex) was assessed for differences using yes or no responses to receiving the full COVID-19 vaccination (2525/3476, 72.64% yes responses and 951/3476, 27.36% no responses), henceforth referred to as vaccine uptake . Ordinal (income and educational level) and continuous (age) demographic variables were analyzed using the Wilcoxon rank sum test ( α =.05). Expected and actual rank sums were reported using Wilcoxon rank sum tests. Nominal variables were analyzed using the chi-square test ( α =.05). For significant chi-square results, demographic response percentages were computed to compare the fully vaccinated and not fully vaccinated groups.

Each of the 15 judgment variables was assessed for differences across yes or no responses to vaccine uptake using the Wilcoxon rank sum test ( α =.05). The expected and actual rank sums were reported. Significant results ( α <.05) were corrected for multiple comparisons using the Benjamini-Hochberg correction, and Q values of <0.05 ( Q Hoch ) were reported.

Prediction Analyses

Logistic regression, random forest, and BRF were used to predict vaccine uptake using judgment, demographic, and COVID-19 precaution variables. Gini plots were produced for random forest and BRF to determine the importance of the judgment variables in predicting COVID-19 vaccination. The BRF algorithm balances the samples by randomly downsampling the majority class at each bootstrapped iteration to match the number of samples in the minority class. To provide greater certainty about the results, random forest and logistic regression were performed to compare with BRF results.

Two sets of BRF, random forest, and logistic regression analyses were run: (1) with the 7 demographic variables and 15 judgment variables included as predictors and (2) with the 7 demographic variables, 15 judgment variables, and 4 COVID-19 precaution behaviors included as predictors. COVID-19 precaution behaviors included yes or no responses to wearing a mask, social distancing, washing hands, and avoiding large gatherings (refer to Table S1 in Multimedia Appendix 1 for more details). The sample sizes for yes or no responses to the COVID-19 precaution behavior questions are provided in Table S2 in Multimedia Appendix 1 . For all 3 models, 10-fold cross-validation was repeated 100 times to obtain performance metrics, where data were split for training (90%) and testing (10%) for each of the 10 iterations in cross-validation. The averages of the performance metrics were reported across 100 repeats of 10-fold cross-validation for the test sets. The reported metrics included accuracy, recall, specificity, negative predictive value (NPV), precision, and area under the receiver operating characteristic curve (AUROC). For BRF, the Python toolbox imbalanced-learn was used to build the classifier, where the training set for each iteration of cross-validation was downsampled but the testing set was unchanged (ie, imbalanced). That is, downsampling only occurred with the bootstrapped samples for training the model, and balancing was not performed on the testing set. The default number of estimators was 100, and the default number of tree splits was 10; the splits were created using the Gini criterion. In separate analyses, estimators were increased to 300, and splits were increased to 15 to test model performance. Using the scikit-learn library, the same procedures used for BRF were followed for random forest without downsampling. Logistic regression without downsampling was implemented with a maximum of 100 iterations and optimization using a limited-memory Broyden-Fletcher-Goldfarb-Shanno solver. For logistic regression, model coefficients with respective SEs, z statistics, P values, and 95% CIs were reported.

Relative feature importance based on the Gini criterion (henceforth referred to as Gini importance ) was determined from BRF and random forest using the .feature_importances_ attribute from scikit-learn, and results were reported as the mean decrease in the Gini score and plotted in Stata. To test model performance using only the top predictors, two additional sets of BRF analyses were run: (1) with the top 3 features as predictors and (2) with the top 3 features and 15 judgment variables as predictors.

Post Hoc Mediation and Moderation

Given the importance of both judgment variables and demographic variables (refer to the Results section), we evaluated post hoc how age, income, and educational level (ie, the top 3 predictors) might statistically influence the relationship between the 15 judgment variables and COVID-19 vaccine uptake . To identify statistical mechanisms influencing our prediction results, we used mediation and moderation, which can (1) determine the directionality between variables and (2) assess variable influence in statistical relationships. Mediation is used to determine whether one variable, the mediator, statistically improves the relationship between 2 other variables (independent variables [IVs] and dependent variables [DVs]) [ 77 - 80 ]. When mediating variables improve a relationship, the mediator is said to sit in the statistical pathway between the IVs and DVs [ 77 , 80 , 81 ]. Moderation is used to test whether the interaction between an IV and a moderating variable predicts a DV [ 81 , 82 ].

For mediation, primary and secondary mediations were performed. Primary mediations included each of the 15 judgment behaviors as the IV, each of the 3 demographic variables (age, income, and educational level) as the mediator, and vaccine uptake as the DV. Secondary mediations held the 15 judgment behaviors as the mediator, the 3 demographic variables as the IV, and vaccine uptake as the DV. For moderation, the moderating variable was each of the 3 demographic variables (age, income, and educational level), the IV was each of the 15 judgment behaviors, and the DV was vaccine uptake . The mathematical procedures for mediation and moderation can be found in Multimedia Appendix 1 .

Demographic Assessment

Of the 400,000 persons queried by Gold Research, Inc, 48,000 (12%) completed the survey, and 3476 (0.87%) survived all quality assurance procedures. Participants were predominately female, married, and White individuals; employed full time with some college education; and middle-aged (mean age 51.40, SD 14.92 years; Table 2 ). Of the 3476 participants, 2525 (72.64%) reported receiving a full dose of a COVID-19 vaccine, and 951 (27.36%) reported not receiving a full dose. Participants who indicated full vaccination were predominately female, married, White individuals, and retired; had some college education; and were older on average (mean age 54.19, SD 14.13 years) when compared to the total cohort. Participants who indicated that they did not receive the full vaccine were also predominately female, married, and White individuals. In contrast to those who received the full vaccination, those not fully vaccinated were predominately employed full time, high school graduates, and of average age (mean age 43.98, SD 14.45 years; median age 45, IQR 32-56 years) when compared to the total cohort. Table 2 summarizes the demographic group sample size percentages for the total cohort, those fully vaccinated, and those not fully vaccinated.

When comparing percentages between vaccination groups, a higher percentage of male individuals were fully vaccinated, and a higher percentage of female individuals were not fully vaccinated ( Table 2 ). In addition, a higher percentage of married, White and Asian or Pacific Islander, and retired individuals indicated receiving the full vaccine when compared to the percentages of those who did not receive the vaccine ( Table 2 ). Conversely, a higher percentage of single, African American, and unemployed individuals indicated not receiving the full vaccine ( Table 2 ).

Analysis of Machine Learning Features

Demographic variable differences by vaccine uptake.

Age, income level, and educational level significantly differed between those who did and did not receive the vaccine (Wilcoxon rank sum test α <.05; Table 3 ). Those who indicated full vaccination were, on average, older (median age 59 y), had a higher annual household income (median reported income level US $50,000-$75,000), and had higher levels of education (the median reported educational level was a bachelor’s degree).

Chi-square tests revealed that marital status, employment status, sex, and ethnicity also varied by full vaccine uptake (chi-square α <.05; Table 3 ).

a N/A: not applicable.

Judgment Variable Differences by Vaccine Uptake

In total, 10 of the 15 judgment variables showed nominal rank differences ( α <.05), and 9 showed significant rank differences after correction for multiple comparisons ( Q Hoch <0.05) between those who indicated full vaccination and those who indicated that they did not receive the full vaccination ( Table 4 ). The 10 features included loss aversion, risk aversion, loss resilience, ante, insurance, peak positive risk, peak negative risk, total reward risk, total aversion risk, and trade-off range. Those who indicated full vaccination exhibited lower loss aversion, ante, peak positive risk, peak negative risk, total reward risk, and total aversion risk as well as higher risk aversion, loss resilience, insurance, and trade-off range when compared to the expected rank sum. Those who did not receive the full vaccination exhibited lower risk aversion, loss resilience, insurance, and trade-off range and higher loss aversion, ante, peak positive risk, peak negative risk, total reward risk, and total aversion risk when compared to the expected rank sum.

Machine Learning Results: Predicting Vaccination Uptake

Prediction results.

With the inclusion of demographic and judgment variables, the BRF classifier with the highest accuracy (68.9%) and precision (86.7%) in predicting vaccine uptake resulted when the number of estimators was set to 300 and the number of splits was set to 10 ( Table 5 ). With the addition of 4 COVID-19 precaution behaviors, the BRF classifier with the highest accuracy (70.8%) and precision (87.8%) to predict vaccine uptake occurred when the number of estimators was set to 300 and the number of splits was set to 10. It is notable that specificity was consistently >72%, precision was >86%, and the AUROC was >75% but the NPV was consistently <50%. For random forest and logistic regression, recall and accuracy values were higher than those for BRF, but specificity was always <39%, indicating a lower performance in predicting those who did not receive the vaccine. Precision was also lower, yet the AUROC was consistent with that of the BRF results.

a A total of 15 judgment variables ( Table 4 ), 7 demographic variables ( Table 3 ), and 4 COVID-19 precaution behavior (covid_beh) variables (Table S1 in Multimedia Appendix 1 ) were included in balanced random forest, random forest, and logistic regression models to predict COVID-19 vaccine uptake . We used 10-fold cross-validation, where the data were split 90-10 for each of the 10 iterations.

b NPV: negative predictive value.

c AUROC: area under the receiver operating characteristic curve.

d BRF: balanced random forest.

e N/A: not applicable.

Feature Importance for BRF and Random Forest

Regarding BRF, Gini importance was highest for age, educational level, and income in both BRF classifiers (both without [ Figures 3 A and 3B] and with [ Figures 3 C and 3D] inclusion of the COVID-19 precaution behaviors; refer to the clusters outlined in red in Figures 3 B and 3D). For both BRF classifiers, the top 3 predictors (age, income, and educational level) had a combined effect of 23.4% on the Gini importance for prediction. Following these predictors, the 15 judgment variables had similar importance scores for both BRF classifiers (range 0.037-0.049; refer to the clusters outlined in black in Figures 3 B and 3D). These 15 predictors had a combined effect of 62.9% to 68.7% on the Gini importance for prediction, indicating that judgment variables were collectively the most important for prediction outcomes. The least important features for predicting vaccination status were demographic variables regarding employment status, marital status, ethnicity, sex, and the 4 COVID-19 precaution behaviors. These predictors only contributed 7.3% to the Gini importance for prediction. As a follow-up analysis, BRF analyses were run using the top 3 features from both the Gini importance plots (age, educational level, and income; Table S4 in Multimedia Appendix 1 ) and the top 3 features plus 15 judgment variables (Table S5 in Multimedia Appendix 1 ). The results did not outperform those presented in Table 5 .

For random forest, the Gini importance was highest for age and educational level ( Figure 4 ). These top 2 predictors had a combined effect of 16.5% to 16.8% for the 2 models ( Figures 4 A and 4C). Following these predictors, the 15 judgment variables and the income variable had similar Gini importance, with a combined effect of 69.4% to 75.5% for Gini importance. The least important predictors mirrored those of the BRF results.

machine learning in healthcare research papers pdf

Logistic Regression Model Statistics

Both model 1 (demographic and judgment variables) and model 2 (demographic, judgment, and COVID-19 precaution behavior variables) were significant ( P <.001). The model statistics are provided in Tables 6 (model 1) and 7 (model 2). In model 1, age, income, marital status, employment status, sex, educational level, ante, aversion tipping point, reward-aversion consistency, and consistency range were significant ( α <.05). In model 2, age, income, marital status, employment status, sex, educational level, risk aversion, ante, peak negative risk, mask wearing, and not gathering in large groups were significant ( α <.05).

a Overall model: P <.001; pseudo- R 2 =0.149; log-likelihood=−1736.8; log-likelihood null=−2039.7.

a Overall model: P <.001; pseudo- R 2 =0.206; log-likelihood=−1620.0; log-likelihood null=−2039.7.

Because judgment variables and demographic variables (age, income, and educational level) were important predictors, we evaluated post hoc whether demographics statistically mediated or moderated the relationship between each of the 15 judgment variables and binary responses to COVID-19 vaccination.

For primary mediations, age significantly mediated the statistical relationship between 11 judgment variables and vaccine uptake ( α <.05; Table 8 ), income mediated 8 relationships α < <.05; Table 8 ), and educational level mediated 9 relationships ( α <.05; Table 8 ). In total, 7 judgment variables overlapped across the 3 models: loss resilience, ante, insurance, peak positive risk, peak negative risk, risk aversion trade-off, and consistency range. Of these, 5 significantly differed between vaccine uptake (those fully vaccinated and those not): loss resilience, ante, insurance, peak positive risk, and peak negative risk ( Table 3 ). Thus, 2 judgment features did not differ by vaccine uptake but were connected with uptake by significant mediation.

For the secondary mediation analyses, 5 judgment variables mediated the statistical relationship between age and vaccine uptake ; these variables overlapped with the 11 findings of the primary mediation analyses. Furthermore, 4 judgment variables mediated the statistical relationship between income and vaccine uptake ; these variables overlapped with the 8 findings of the primary mediation analyses. Finally, 4 judgment variables mediated the statistical relationship between educational level and vaccine uptake ; these variables overlapped with the 9 findings of the primary mediation analyses. In all secondary analyses, approximately half of the judgment variables were involved in mediation as compared to the doubling of judgment variable numbers observed in the primary mediation analyses. In the secondary mediation analyses, the same 4 judgment variables were found in both primary and secondary mediation results, indicating a mixed mediation framework.

From the moderation analyses, only 2 interactions out of a potential 45 were observed. Age interacted with risk aversion trade-off, and income interacted with loss resilience to statistically predict vaccine uptake ( α <.05; Table 8 ). The 2 moderation results overlapped with the mediation results, indicating mixed mediation-moderation relationships [ 78 , 80 , 81 ].

Principal Findings

Relatively few studies have sought to predict COVID-19 vaccine uptake using machine learning approaches [ 8 , 59 ]. Given that a small set of studies has assessed the psychological basis that may underlie vaccine uptake and choices [ 6 , 52 , 53 , 56 , 58 , 59 , 83 ], but none have used computational cognition variables based on reward and aversion judgment to predict vaccine uptake , we sought to assess whether variables quantifying human judgment predicted vaccine uptake . This study found that 7 demographic and 15 judgment variables predicted vaccine uptake with balanced and moderate recall and specificity, moderate accuracy, high AUROC, and high precision using a BRF framework. Other machine learning approaches (random forest and logistic regression) produced higher accuracies but lower specificities, indicating a lower prediction of those who did not receive the vaccine. The BRF also had challenges predicting the negative class, as demonstrated by the relatively low NPV despite having higher specificity than random forest and logistic regression. Feature importance analyses from both BRF and random forest showed that the judgment variables collectively dominated the Gini importance scores. Furthermore, demographic variables acted as statistical mediators in the relationship between judgment variables and vaccine uptake . These mediation findings support the interpretation of the machine learning results that demographic factors, together with judgment variables, predict COVID-19 vaccine uptake .

Interpretation of Judgment Differences Between Vaccinated and Nonvaccinated Individuals

Those who were fully vaccinated had lower values for loss aversion, ante, peak positive risk, peak negative risk, total reward risk, and total aversion risk, along with higher values for risk aversion, loss resilience, insurance, and trade-off range (refer to Table 1 for variable descriptions). Lower loss aversion corresponds to less overweighting of bad outcomes relative to good ones [ 84 ] and a potential willingness to obtain a vaccine with uncertain outcomes. A lower ante suggests that individuals are less willing to engage in risky behaviors surrounding potential infection, which is also consistent with the 4 other judgment variables that define relationships between risk and value (peak positive risk, peak negative risk, total reward risk, and total aversion risk). In participants who indicated full vaccination, lower peak positive risk and peak negative risk were related to individuals having a lower risk that they must overcome to make a choice to either approach or avoid, as per the decision utility equation by Markowitz [ 39 , 71 ]. The lower total reward risk and total aversion risk indicate that the interactions between reward, aversion, and the risks associated with them did not scale significantly; namely, higher reward was not associated with higher risk, and higher negative outcomes were not associated with the uncertainty of them. For these participants, the ability of the vaccine to increase the probability of health and reduce the probability of harm from illness did not have to overcome high obstacles in their vaccine choice. Higher risk aversion in vaccinated participants suggests that these participants viewed contracting COVID-19 as a larger risk and, therefore, were more likely to receive the full dose. These findings are consistent with those of a study by Lepinteur et al [ 58 ], who found that risk-averse individuals were more likely to accept the COVID-19 vaccination, indicating that the perceived risk of contracting COVID-19 was greater than any risk from the vaccine. Hudson and Montelpare [ 54 ] also found that risk aversion may promote vaccine adherence when people perceive contracting a disease as more dangerous or likely. Higher loss resilience in the vaccinated group was also consistent with the perspective that vaccination would improve their resilience and act as a form of insurance against negative consequences. The higher trade-off range suggests that vaccinated individuals have a broader portfolio of preferences and are more adaptive to bad things occurring, whereas a lower trade-off indicates a restriction in preferences and less adaptability in those who did not receive the vaccine.

Comparison of Prediction Algorithms

When testing these judgment variables (with demographic and COVID-19 precaution behavior variables) in a BRF framework to predict vaccine uptake , we observed a high AUROC of 0.79, where an AUROC of 0.8 is often the threshold for excellent model performance in machine learning [ 85 , 86 ]. The similarity of our reported recall and specificity values with the BRF suggests a balance between predicting true positives and true negatives. The high precision indicates a high certainty in predicting those who were fully vaccinated. The BRF model was successful in identifying those who received the full vaccine (positive cases; indicated by high precision and moderate recall) and those who did not (negative cases; indicated by the specificity). However, NPV was low, indicating a higher rate of false prediction of those who did not receive a full dose counterbalanced by a higher specificity that reflects a higher rate of predicting true negatives. These observations are reflected in the moderate accuracy, which measures the number of correct predictions. A comparison of random forest, logistic regression, and BRF revealed that random forest and logistic regression models produced less balance between recall (high) and specificity (low), which could be interpreted as a bias toward predicting the majority class (ie, those who received the vaccine). That being said, the NPV for BRF was lower than that for random forest and logistic regression, where a low NPV indicates a low probability that those predicted to have not received the vaccine truly did not receive the vaccine when taking both classes into account. Together, the results from all 3 machine learning approaches reveal challenges in predicting the negative class (ie, those who did not receive the vaccine). Overall, the 3 models achieved high accuracy, recall, precision, and AUROC. BRF produced a greater balance between recall and specificity, and the outcome of the worst-performing metric (ie, NPV) was still higher than the specificities for the random forest and logistic regression models.

Feature Importance

Of the 3 prediction algorithms, random forest and BRF had very similar Gini importance results, whereas logistic regression elevated most demographic variables and a minority of judgment variables. This observation could be due to the large variance in each of the judgment variables, which could present challenges for achieving a good fit with logistic regression. In contrast, the demographic and COVID-19 precaution variables had low variance and could be more easily fit in a linear model, hence their significance in the logistic regression results. In comparison to logistic regression, decision trees (eg, BRF and random forest) use variable variance as additional information to optimize classification, potentially leading to a higher importance of judgment variables over most demographic and all COVID-19 precaution variables.

Focusing on the model with balanced recall and specificity (ie, the BRF classifiers [with and without COVID-19 precaution behaviors]), the top predictors were 3 demographic variables (age, income, and educational level), with distributions that varied by vaccine uptake in manners consistent with those of other reports. Namely, older individuals, those identifying as male and White individuals, and those who indicated a higher income and educational level corresponded to those who were or intended to be vaccinated [ 2 , 5 , 87 ]. Despite their saliency, these 3 variables together only contributed 23% to the prediction, corresponding to approximately one-third of the contribution from the 15 judgment variables (63%-69%). The individual Gini importance scores for the 15 judgment variables only ranged from 0.039 to 0.049 but were the dominant set of features behind the moderate accuracy, high precision, and high AUROC. The 18% difference between the accuracy and precision measures suggests that variables other than those used in this study may improve prediction, including contextual variables that may influence vaccine choices. Variables may include political affiliation [ 7 ], longitude and latitude [ 8 ], access to the internet [ 8 ], health literacy [ 54 ], and presence of underlying conditions [ 9 ]. Future work should seek to include these types of variables.

In the second BRF classifier, the 4 COVID-19 precaution behaviors only contributed 6.6% to the prediction. This low contribution could be due to these variables being binary, unlike the other demographic variables, which included a range of categories. In addition, COVID-19 precaution behaviors are specific to the context of the COVID-19 pandemic and do not promote interpretation beyond their specific context. The 15 judgment variables represent a contrast to this as they are empirically computed from a set of functions across many picture categories. An individual with higher risk aversion will generally tolerate higher amounts of uncertainty regarding a potential upside or gain as opposed to settling for what they have. This does not depend on what stimulus category they observe or the stimulus-response condition. Instead, it is a general feature of the bounds to their judgment and is part of what behavioral economists such as Kahneman consider as bounds to human rationality [ 84 ].

Mechanistic Relationships Between Judgment and Demographic Variables

The Gini score plots were clear sigmoid-like graphs ( Figure 3 ), with only 3 of the 7 demographic variables ranking above the judgment variables. This observation was consistent in both BRF classifiers (with and without COVID-19 precaution behaviors), raising the possibility of a statistically mechanistic relationship among the top 3 demographic variables, the 15 judgment variables, and vaccine uptake . Indeed, we observed 28 primary mediation effects and 13 secondary mediation effects in contrast to 2 moderation relationships, which also happened to overlap with mediation findings, suggesting mixed mediation-moderation relationships [ 81 , 88 ]. The observation that most judgment variables were significant in mediation relationships but not in moderation relationships argues that prediction depended on the directional relationship between judgment and demographic variables to predict vaccine uptake . Furthermore, there were more significant primary mediations (when judgment variables were the IVs) compared to secondary mediations, suggesting the importance of judgment variables as IVs and demographic variables as mediators. Mathematically, judgment variables (IVs) influenced vaccine uptake (DV), and this relationship was stronger when demographic variables were added to the equation. The 13 secondary mediations all overlapped with the 28 primary mediations, where demographic variables were IVs and judgment variables were mediators, suggesting that demographic variables influenced vaccine uptake (DV) and that this relationship became stronger with the addition of judgment variables. This overlap of primary and secondary mediations for 4 of the judgment variables suggests that both judgment and demographic variables influenced the choice of being vaccinated within a mixed mediation framework because adding either one of them to the mediation model regressions made the relationships stronger [ 49 ]. The lack of moderation results and a considerable number of overlapping primary and secondary mediation results imply that the relationship between judgment variables and vaccine uptake did not depend purely on their interaction with age, income, or educational level (ie, moderation) but, instead, depended on the direct effects of these 3 demographic variables to strengthen the relationship between judgment variables and vaccine uptake . This type of analysis of statistical mechanisms is helpful for understanding contextual effects on our biases and might be important for considering how best to target or message those with higher loss aversion, ante, peak positive risk, peak negative risk, total reward risk, and total aversion risk (ie, in those who were not fully vaccinated).

Model Utility

The developed model is automatable and may have applications in public health. The picture-rating task can be deployed on any smart device or computer, making it accessible to much of the US population or regional populations. The ratings from this task can be automatically processed, and the results can be stored in local or national databases. This method of data collection is novel in that persons cannot bias their responses as the rating task has no perceivable relation to vaccination choices. Government and public health bodies can access these data to determine predicted vaccine uptake rates locally or nationally, which can be used to (1) prepare vaccine rollouts and supply chain demand, (2) prepare health care institutions in areas that may experience low vaccine adherence and potentially higher infection rates, and (3) determine which areas may need more targeted messaging to appeal to specific judgment profiles. For use case 3, messaging about infection risks or precaution behaviors could be framed to address those with lower risk aversion, who, in this study, tended to forgo vaccination. Given that such individualized data would not be available a priori, it would be more plausible to collect data from similarly sized cohorts in geographic regions of concern to obtain regional judgment behavior profiles and, thus, target use cases 1 to 3. Further development of this model with different population samples might also improve our understanding of how certain judgment variables may be targeted with different types of messaging, offering a means to potentially improve vaccine uptake . This model might also be applied to other mandated or recommended vaccines such as those for influenza or human papillomavirus, ultimately improving preparation and messaging efforts. However, future work would be needed to model these varying vaccine choices.

Given the use of demographic variables in the proposed model, specific demographic populations could be assessed or considered for messaging. If particular demographic groups are predicted to have a low vaccine uptake rate, messaging can be targeted to those specific groups. For example, we observed that a higher percentage of female individuals were not fully vaccinated when compared to male individuals. This could be related to concerns about the COVID-19 vaccine affecting fertility or pregnancy. To improve uptake in this population, scientifically backed messaging could be used to confirm the safety of the vaccine in this context. Lower rates of vaccination have been reported in Black communities, which was also observed in this study. Researchers have identified targetable issues related to this observation, which include engagement of Black faith leaders and accessibility of vaccination clinics in Black communities, to name a few [ 89 ].

In summary, this model could be used to predict vaccine uptake at the local and national levels and further assess the demographic and judgment features that may underlie these choices.

Limitations

This study has a number of limitations that should be considered. First, there are the inherent limitations of using an internet survey—namely, the uncontrolled environment in which participants provide responses. Gold Research, Inc, and the research team applied stringent exclusion criteria, including the evaluation of the judgment graphs given that random responses produce graphs with extremely low R 2 fits (eg, <0.1). This was not the case in our cohort of 3476 participants, but this cannot perfectly exclude random or erroneous responses to other questionnaire components. Second, participants with mental health conditions were oversampled to meet the criteria for other survey components not discussed in this paper. This oversampling could potentially bias the results, and future work should use a general population sample to verify these findings. Third, demographic variability and the resulting confounds are inherent in population surveys, and other demographic factors not collected in this study may be important for prediction (eg, religion and family size). Future work might consider collecting a broader array of demographic factors to investigate and include in predictive modeling. Fourth, we used a limited set of 7 demographic variables and 15 judgment variables; however, a larger set of judgment variables is potentially computable and could be considered for future studies. There is also little information on how post–COVID-19 effects, including socioeconomic effects, affect COVID-19 vaccination choices.

Conclusions

To our knowledge, there has been minimal research on how biases in human judgment might contribute to the psychology underlying individual vaccination preferences and what differentiates individuals who were fully vaccinated against COVID-19 from those who were not. This population study of several thousand participants demonstrated that a small set of demographic variables and 15 judgment variables predicted vaccine uptake with moderate to high accuracy and high precision and AUROC, although a large range of specificities was achieved depending on the classification method used. In an age of big data machine learning approaches, this study provides an option for using fewer but more interpretable variables. Age, income, and educational level were independently the most important predictors of vaccine uptake , but judgment variables collectively dominated the importance rankings and contributed almost two-thirds to the prediction of COVID-19 vaccination for the BRF and random forest models. Age, income, and educational level significantly mediated the statistical relationship between judgment variables and vaccine uptake , indicating a statistically mechanistic relationship grounding the prediction results. These findings support the hypothesis that small sets of judgment variables might provide a target for vaccine education and messaging to improve uptake. Such education and messaging might also need to consider contextual variables (ie, age, income, and educational level) that mediate the effect of judgment variables on vaccine uptake . Judgment and demographic variables can be readily collected using any digital device, including smartphones, which are accessible worldwide. Further development and use of this model could (1) improve vaccine uptake , (2) better prepare vaccine rollouts and health care institutions, (3) improve messaging efforts, and (4) have applications for other mandated or government-recommended vaccines.

Acknowledgments

The authors thank Carol Ross, Angela Braggs-Brown, Tom Talavage, Eric Nauman, and Marc Cahay at the University of Cincinnati (UC) College of Engineering and Applied Sciences, who significantly impacted the transfer of research funding to UC. Funding for this work was provided in part to HCB by the Office of Naval Research (awards N00014-21-1-2216 and N00014-23-1-2396) and to HCB from a Jim Goetz donation to the UC College of Engineering and Applied Sciences. Finally, the authors thank the anonymous reviewers for their constructive input, which substantially improved the manuscript. The opinions expressed in this paper are those of the authors and are not necessarily representative of those of their respective institutions.

Data Availability

The data set and corresponding key used in this study are available in Multimedia Appendix 2 .

Conflicts of Interest

A provisional patent has been submitted by the following authors (NLV, SB, HCB, SL, LS, and AKK): “Methods of predicting vaccine uptake,” provisional application # 63/449,460.

Supplementary material.

  • Ezati Rad R, Kahnouji K, Mohseni S, Shahabi N, Noruziyan F, Farshidi H, et al. Predicting the COVID-19 vaccine receive intention based on the theory of reasoned action in the south of Iran. BMC Public Health. Feb 04, 2022;22(1):229. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Mewhirter J, Sagir M, Sanders R. Towards a predictive model of COVID-19 vaccine hesitancy among American adults. Vaccine. Mar 15, 2022;40(12):1783-1789. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Romate J, Rajkumar E, Greeshma R. Using the integrative model of behavioural prediction to understand COVID-19 vaccine hesitancy behaviour. Sci Rep. Jun 04, 2022;12(1):9344. [ CrossRef ] [ Medline ]
  • Kalam MA, Davis TPJ, Shano S, Uddin MN, Islam MA, Kanwagi R, et al. Exploring the behavioral determinants of COVID-19 vaccine acceptance among an urban population in Bangladesh: implications for behavior change interventions. PLoS One. 2021;16(8):e0256496. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Koesnoe S, Siddiq TH, Pelupessy DC, Yunihastuti E, Awanis GS, Widhani A, et al. Using integrative behavior model to predict COVID-19 vaccination intention among health care workers in Indonesia: a nationwide survey. Vaccines (Basel). May 04, 2022;10(5):719. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Hudson A, Hall PA, Hitchman SC, Meng G, Fong GT. Cognitive predictors of COVID-19 mitigation behaviors in vaccinated and unvaccinated general population members. Vaccine. Jun 19, 2023;41(27):4019-4026. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Dhanani LY, Franz B. A meta-analysis of COVID-19 vaccine attitudes and demographic characteristics in the United States. Public Health. Jun 2022;207:31-38. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Cheong Q, Au-Yeung M, Quon S, Concepcion K, Kong JD. Predictive modeling of vaccination uptake in US counties: a machine learning-based approach. J Med Internet Res. Nov 25, 2021;23(11):e33231. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bulusu A, Segarra C, Khayat L. Analysis of COVID-19 vaccine uptake among people with underlying chronic conditions in 2022: a cross-sectional study. SSM Popul Health. Jun 2023;22:101422. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Painter EM, Ussery EN, Patel A, Hughes MM, Zell ER, Moulia DL, et al. Demographic characteristics of persons vaccinated during the first month of the COVID-19 vaccination program - United States, December 14, 2020-January 14, 2021. MMWR Morb Mortal Wkly Rep. Feb 05, 2021;70(5):174-177. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Azcona EA, Kim BW, Vike NL, Bari S, Lalvani S, Stefanopoulos L, et al. Discrete, recurrent, and scalable patterns in human judgement underlie affective picture ratings. arXiv. Preprint posted online March 12, 2022. 2022.:2203.06448. [ FREE Full text ] [ CrossRef ]
  • Breiter HC, Block M, Blood AJ, Calder B, Chamberlain L, Lee N, et al. Redefining neuromarketing as an integrated science of influence. Front Hum Neurosci. Feb 12, 2014;8:1073. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Schneirla TC. An evolutionary and developmental theory of biphasic processes underlying approach and withdrawal. In: Jones MR, editor. Nebraska Symposium on Motivation. Lincoln, Nebraska. University of Nebraska Press; 1959;1-42.
  • Schneirla TC. Aspects of stimulation and organization in approach/withdrawal processes underlying vertebrate behavioral development. In: Advances in the Study of Behavior. Cambridge, MA. Academic Press; 1965;1-74.
  • Lewin K. Psycho-sociological problems of a minority group. J Personality. Mar 1935;3(3):175-187. [ CrossRef ]
  • Herrnstein RJ. Secondary reinforcement and rate of primary reinforcement. J Exp Anal Behav. Jan 1964;7(1):27-36. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Baum WM. On two types of deviation from the matching law: bias and undermatching. J Exp Anal Behav. Jul 1974;22(1):231-242. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Kim BW, Kennedy DN, Lehár J, Lee MJ, Blood AJ, Lee S, et al. Recurrent, robust and scalable patterns underlie human approach and avoidance. PLoS One. May 26, 2010;5(5):e10613. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Berridge KC, Robinson TE. The mind of an addicted brain: neural sensitization of wanting versus liking. Curr Dir Psychol Sci. 1995;4(3):71-76. [ FREE Full text ] [ CrossRef ]
  • Montague PR. Free will. Curr Biol. Jul 22, 2008;18(14):R584-R585. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Dai X, Brendl CM, Ariely D. Wanting, liking, and preference construction. Emotion. Jun 2010;10(3):324-334. [ CrossRef ] [ Medline ]
  • Mas-Colell A, Whinston MD, Green JR. Microeconomic Theory. Oxford, UK. Oxford University Press; 1995.
  • Marshall A. Principles of Economics. Buffalo, NY. Prometheus Books; 1997.
  • Aharon I, Etcoff N, Ariely D, Chabris CF, O'Connor E, Breiter HC. Beautiful faces have variable reward value: fMRI and behavioral evidence. Neuron. Nov 08, 2001;32(3):537-551. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Viswanathan V, Lee S, Gilman JM, Kim BW, Lee N, Chamberlain L, et al. Age-related striatal BOLD changes without changes in behavioral loss aversion. Front Hum Neurosci. Apr 30, 2015;9:176. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Perlis RH, Holt DJ, Smoller JW, Blood AJ, Lee S, Kim BW, et al. Association of a polymorphism near CREB1 with differential aversion processing in the insula of healthy participants. Arch Gen Psychiatry. Aug 2008;65(8):882-892. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Gasic GP, Smoller JW, Perlis RH, Sun M, Lee S, Kim BW, et al. BDNF, relative preference, and reward circuitry responses to emotional communication. Am J Med Genet B Neuropsychiatr Genet. Sep 05, 2009;150B(6):762-781. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Makris N, Oscar-Berman M, Jaffin SK, Hodge SM, Kennedy DN, Caviness VS, et al. Decreased volume of the brain reward system in alcoholism. Biol Psychiatry. Aug 01, 2008;64(3):192-202. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Elman I, Ariely D, Mazar N, Aharon I, Lasko NB, Macklin ML, et al. Probing reward function in post-traumatic stress disorder with beautiful facial images. Psychiatry Res. Jun 30, 2005;135(3):179-183. [ CrossRef ] [ Medline ]
  • Strauss MM, Makris N, Aharon I, Vangel MG, Goodman J, Kennedy DN, et al. fMRI of sensitization to angry faces. Neuroimage. Jun 18, 2005;26(2):389-413. [ CrossRef ] [ Medline ]
  • Levy B, Ariely D, Mazar N, Chi W, Lukas S, Elman I. Gender differences in the motivational processing of facial beauty. Learn Motiv. May 2008;39(2):10.1016/j.lmot.2007.09.002. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Yamamoto R, Ariely D, Chi W, Langleben DD, Elman I. Gender differences in the motivational processing of babies are determined by their facial attractiveness. PLoS One. Jun 24, 2009;4(6):e6042. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Viswanathan V, Sheppard JP, Kim BW, Plantz CL, Ying H, Lee MJ, et al. A quantitative relationship between signal detection in attention and approach/avoidance behavior. Front Psychol. Feb 21, 2017;8:122. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Stefanopoulos L, Lalvani S, Kim BW, Vike NL, Bari S, Emanuel A, et al. Predicting depression history from a short reward/aversion task with behavioral economic features. In: Proceedings of the International Conference on Biomedical and Health Informatics 2022. 2022. Presented at: ICBHI 2022; November 24-26, 2022; Concepcion, Chile.
  • Shannon CE, Weaver W. The Mathematical Theory of Communication. Urbana, IL. The University of Illinois Press; 1949.
  • Livengood SL, Sheppard JP, Kim BW, Malthouse EC, Bourne JE, Barlow AE, et al. Keypress-based musical preference is both individual and lawful. Front Neurosci. May 02, 2017;11:136. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Feynman R, Wilczek F. The Character of Physical Law. New York, NY. Modern Library; 1965.
  • Kahneman D, Tversky A. Prospect theory: an analysis of decision under risk. Econometrica. Mar 1979;47(2):263-292. [ CrossRef ]
  • Markowitz H. Portfolio selection*. J Finance. Apr 30, 2012;7(1):77-91. [ CrossRef ]
  • Nagaya K. Why and under what conditions does loss aversion emerge? Jpn Psychol Res. Oct 15, 2021;65(4):379-398. [ CrossRef ]
  • Tversky A, Kahneman D. Judgment under uncertainty: heuristics and biases. Science. Sep 27, 1974;185(4157):1124-1131. [ CrossRef ] [ Medline ]
  • Macoveanu J, Rowe JB, Hornboll B, Elliott R, Paulson OB, Knudsen GM, et al. Serotonin 2A receptors contribute to the regulation of risk-averse decisions. Neuroimage. Dec 2013;83:35-44. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Macoveanu J, Rowe JB, Hornboll B, Elliott R, Paulson OB, Knudsen GM, et al. Playing it safe but losing anyway--serotonergic signaling of negative outcomes in dorsomedial prefrontal cortex in the context of risk-aversion. Eur Neuropsychopharmacol. Aug 2013;23(8):919-930. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Schultz W. Dopamine signals for reward value and risk: basic and recent data. Behav Brain Funct. Apr 23, 2010;6:24. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Manssuer L, Ding Q, Zhang Y, Gong H, Liu W, Yang R, et al. Risk and aversion coding in human habenula high gamma activity. Brain. Jun 01, 2023;146(6):2642-2653. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Thrailkill EA, DeSarno M, Higgins ST. Intersections between environmental reward availability, loss aversion, and delay discounting as potential risk factors for cigarette smoking and other substance use. Prev Med. Dec 2022;165(Pt B):107270. [ CrossRef ] [ Medline ]
  • Vrijen C, Hartman CA, Oldehinkel AJ. Reward-related attentional bias at age 16 predicts onset of depression during 9 years of follow-up. J Am Acad Child Adolesc Psychiatry. Mar 2019;58(3):329-338. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Hansen A, Turpyn CC, Mauro K, Thompson JC, Chaplin TM. Adolescent brain response to reward is associated with a bias toward immediate reward. Dev Neuropsychol. Aug 2019;44(5):417-428. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Morales S, Miller NV, Troller-Renfree SV, White LK, Degnan KA, Henderson HA, et al. Attention bias to reward predicts behavioral problems and moderates early risk to externalizing and attention problems. Dev Psychopathol. May 2020;32(2):397-409. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Wang S, Krajbich I, Adolphs R, Tsuchiya N. The role of risk aversion in non-conscious decision making. Front Psychol. 2012;3:50. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Chen X, Voets S, Jenkinson N, Galea JM. Dopamine-dependent loss aversion during effort-based decision-making. J Neurosci. Jan 15, 2020;40(3):661-670. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Trueblood JS, Sussman AB, O’Leary D. The role of risk preferences in responses to messaging about COVID-19 vaccine take-up. Soc Psychol Personal Sci. Mar 11, 2021;13(1):311-319. [ CrossRef ]
  • Wagner CE, Prentice JA, Saad-Roy CM, Yang L, Grenfell BT, Levin SA, et al. Economic and behavioral influencers of vaccination and antimicrobial use. Front Public Health. Dec 21, 2020;8:614113. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Hudson A, Montelpare WJ. Predictors of vaccine hesitancy: implications for COVID-19 public health messaging. Int J Environ Res Public Health. Jul 29, 2021;18(15):8054. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Horne Z, Powell D, Hummel JE, Holyoak KJ. Countering antivaccination attitudes. Proc Natl Acad Sci U S A. Aug 18, 2015;112(33):10321-10324. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Benin AL, Wisler-Scher DJ, Colson E, Shapiro ED, Holmboe ES. Qualitative analysis of mothers' decision-making about vaccines for infants: the importance of trust. Pediatrics. May 2006;117(5):1532-1541. [ CrossRef ] [ Medline ]
  • Noyman-Veksler G, Greenberg D, Grotto I, Shahar G. Parents' malevolent personification of mass vaccination solidifies vaccine hesitancy. J Health Psychol. Oct 2021;26(12):2164-2172. [ CrossRef ] [ Medline ]
  • Lepinteur A, Borga LG, Clark AE, Vögele C, D'Ambrosio C. Risk aversion and COVID-19 vaccine hesitancy. Health Econ. Aug 2023;32(8):1659-1669. [ CrossRef ] [ Medline ]
  • Becchetti L, Candio P, Salustri F. Vaccine uptake and constrained decision making: the case of Covid-19. Soc Sci Med. Nov 2021;289:114410. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F. Learning from Imbalanced Data Sets. Cham, Switzerland. Springer International Publishing; 2018.
  • Bari S, Vike NL, Stetsiv K, Woodward S, Lalvani S, Stefanopoulos L, et al. The prevalence of psychotic symptoms, violent ideation, and disruptive behavior in a population with SARS-CoV-2 infection: preliminary study. JMIR Form Res. Aug 16, 2022;6(8):e36444. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Woodward SF, Bari S, Vike N, Lalvani S, Stetsiv K, Kim BW, et al. Anxiety, post-COVID-19 syndrome-related depression, and suicidal thoughts and behaviors in COVID-19 survivors: cross-sectional study. JMIR Form Res. Oct 25, 2022;6(10):e36656. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Vike NL, Bari S, Stetsiv K, Woodward S, Lalvani S, Stefanopoulos L, et al. The Relationship Between a History of High-risk and Destructive Behaviors and COVID-19 Infection: Preliminary Study. JMIR Form Res. 2023;7:e40821. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lee S, Lee MJ, Kim BW, Gilman JM, Kuster JK, Blood AJ, et al. The commonality of loss aversion across procedures and stimuli. PLoS One. Sep 22, 2015;10(9):e0135216. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Tversky A, Kahneman D. Advances in prospect theory: cumulative representation of uncertainty. J Risk Uncertainty. Oct 1992;5(4):297-323. [ CrossRef ]
  • Sheppard JP, Livengood SL, Kim BW, Lee MJ, Blood AJ. Connecting prospect and portfolio theories through relative preference behavior. In: Proceedings of the 14th Annual Meeting on Society for NeuroEconomics. 2016. Presented at: SNE '16; August 28-30, 2016;102-103; Berlin, Germany. URL: https://staging.neuroeconomics.org/wp-content/uploads/2016/08/AbstractBookSNE2016.pdf
  • Zhang R, Brennan TJ, Lo AW. The origin of risk aversion. Proc Natl Acad Sci U S A. Dec 16, 2014;111(50):17777-17782. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Iacobucci D. Mediation analysis and categorical variables: the final frontier. J Consum Psychol. Apr 12, 2012;22(4):582-594. [ CrossRef ]
  • Lang PJ, Bradley MM, Cuthbert MM. Motivated attention: affect, activation and action. In: Lang PJ, Simons RF, Balaban MT, editors. Attention and Orienting: Sensory and Motivational Processes. Hillsdale, NJ. Lawrence Erlbaum Associates; 1997;97-136.
  • Lang PJ, Bradley MM, Cuthbert BN. International affective picture system (IAPS): affective ratings of pictures and instruction manual. Technical Report A-8. University of Florida. 2008. URL: https://www.scirp.org/reference/referencespapers?referenceid=755311 [accessed 2021-06-01]
  • Markowitz H. The Utility of Wealth. J Political Econ. Apr 1952;60(2):151-158. [ CrossRef ]
  • StataCorp. Stata statistical software: release 17. StataCorp LLC. College Station, TX. StataCorp LLC; 2021. URL: https://www.stata.com/company/ [accessed 2024-02-23]
  • Van Rossum G, Drake FL. Python 3 Reference Manual. Scotts Valley, CA. CreateSpace Independent Publishing Platform; 2009.
  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(85):2825-2830. [ FREE Full text ]
  • Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(17):1-5. [ FREE Full text ]
  • R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing. 2019. URL: https://www.r-project.org/ [accessed 2024-02-23]
  • David A. Kenny's homepage. David A. Kenny. URL: http://davidakenny.net/cm/mediate.htm [accessed 2024-02-22]
  • MacKinnon DP, Fairchild AJ, Fritz MS. Mediation analysis. Annu Rev Psychol. 2007;58:593-614. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • MacKinnon DP, Cheong J, Pirlott AG. Statistical mediation analysis. In: Cooper H, Camic PM, Long DL, Panter AT, Rindskopf D, Sher KJ, editors. APA Handbook of Research Methods in Psychology. Washington, DC. American Psychological Association; 2012;313-331.
  • Mackinnon DP. Introduction to Statistical Mediation Analysis. New York, NY. Lawrence Erlbaum Associates; 2008.
  • Baron RM, Kenny DA. The moderator–mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J Pers Soc Psychol. 1986;51(6):1173-1182. [ CrossRef ]
  • Hayes AF. Introduction to Mediation, Moderation, and Conditional Process Analysis: A Regression-Based Approach. New York, NY. Guilford Press; 2018.
  • Charpentier CJ, Aylward J, Roiser JP, Robinson OJ. Enhanced risk aversion, but not loss aversion, in unmedicated pathological anxiety. Biol Psychiatry. Jun 15, 2017;81(12):1014-1022. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Kahneman D, Tversky A. On the interpretation of intuitive probability: a reply to Jonathan Cohen. Cognition. Jan 1979;7(4):409-411. [ CrossRef ]
  • Mandrekar JN. Receiver operating characteristic curve in diagnostic test assessment. J Thorac Oncol. Sep 2010;5(9):1315-1316. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Watt J, Borhani R, Katsaggelos AK. Machine Learning Refined: Foundations, Algorithms, and Applications. Cambridge, UK. Cambridge University Press; 2016.
  • Viswanath K, Bekalu M, Dhawan D, Pinnamaneni R, Lang J, McLoud R. Individual and social determinants of COVID-19 vaccine uptake. BMC Public Health. Apr 28, 2021;21(1):818. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Judd CM, Kenny DA, McClelland GH. Estimating and testing mediation and moderation in within-subject designs. Psychol Methods. Jun 2001;6(2):115-134. [ CrossRef ] [ Medline ]
  • Abdul-Mutakabbir JC, Casey S, Jews V, King A, Simmons K, Hogue MD, et al. A three-tiered approach to address barriers to COVID-19 vaccine delivery in the Black community. Lancet Glob Health. Jun 2021;9(6):e749-e750. [ FREE Full text ] [ CrossRef ] [ Medline ]

Abbreviations

Edited by A Mavragani; submitted 11.04.23; peer-reviewed by ME Visier Alfonso, L Lapp; comments to author 18.05.23; revised version received 08.08.23; accepted 10.01.24; published 18.03.24.

©Nicole L Vike, Sumra Bari, Leandros Stefanopoulos, Shamal Lalvani, Byoung Woo Kim, Nicos Maglaveras, Martin Block, Hans C Breiter, Aggelos K Katsaggelos. Originally published in JMIR Public Health and Surveillance (https://publichealth.jmir.org), 18.03.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.

A generative AI reset: Rewiring to turn potential into value in 2024

It’s time for a generative AI (gen AI) reset. The initial enthusiasm and flurry of activity in 2023 is giving way to second thoughts and recalibrations as companies realize that capturing gen AI’s enormous potential value is harder than expected .

With 2024 shaping up to be the year for gen AI to prove its value, companies should keep in mind the hard lessons learned with digital and AI transformations: competitive advantage comes from building organizational and technological capabilities to broadly innovate, deploy, and improve solutions at scale—in effect, rewiring the business  for distributed digital and AI innovation.

About QuantumBlack, AI by McKinsey

QuantumBlack, McKinsey’s AI arm, helps companies transform using the power of technology, technical expertise, and industry experts. With thousands of practitioners at QuantumBlack (data engineers, data scientists, product managers, designers, and software engineers) and McKinsey (industry and domain experts), we are working to solve the world’s most important AI challenges. QuantumBlack Labs is our center of technology development and client innovation, which has been driving cutting-edge advancements and developments in AI through locations across the globe.

Companies looking to score early wins with gen AI should move quickly. But those hoping that gen AI offers a shortcut past the tough—and necessary—organizational surgery are likely to meet with disappointing results. Launching pilots is (relatively) easy; getting pilots to scale and create meaningful value is hard because they require a broad set of changes to the way work actually gets done.

Let’s briefly look at what this has meant for one Pacific region telecommunications company. The company hired a chief data and AI officer with a mandate to “enable the organization to create value with data and AI.” The chief data and AI officer worked with the business to develop the strategic vision and implement the road map for the use cases. After a scan of domains (that is, customer journeys or functions) and use case opportunities across the enterprise, leadership prioritized the home-servicing/maintenance domain to pilot and then scale as part of a larger sequencing of initiatives. They targeted, in particular, the development of a gen AI tool to help dispatchers and service operators better predict the types of calls and parts needed when servicing homes.

Leadership put in place cross-functional product teams with shared objectives and incentives to build the gen AI tool. As part of an effort to upskill the entire enterprise to better work with data and gen AI tools, they also set up a data and AI academy, which the dispatchers and service operators enrolled in as part of their training. To provide the technology and data underpinnings for gen AI, the chief data and AI officer also selected a large language model (LLM) and cloud provider that could meet the needs of the domain as well as serve other parts of the enterprise. The chief data and AI officer also oversaw the implementation of a data architecture so that the clean and reliable data (including service histories and inventory databases) needed to build the gen AI tool could be delivered quickly and responsibly.

Never just tech

Creating value beyond the hype

Let’s deliver on the promise of technology from strategy to scale.

Our book Rewired: The McKinsey Guide to Outcompeting in the Age of Digital and AI (Wiley, June 2023) provides a detailed manual on the six capabilities needed to deliver the kind of broad change that harnesses digital and AI technology. In this article, we will explore how to extend each of those capabilities to implement a successful gen AI program at scale. While recognizing that these are still early days and that there is much more to learn, our experience has shown that breaking open the gen AI opportunity requires companies to rewire how they work in the following ways.

Figure out where gen AI copilots can give you a real competitive advantage

The broad excitement around gen AI and its relative ease of use has led to a burst of experimentation across organizations. Most of these initiatives, however, won’t generate a competitive advantage. One bank, for example, bought tens of thousands of GitHub Copilot licenses, but since it didn’t have a clear sense of how to work with the technology, progress was slow. Another unfocused effort we often see is when companies move to incorporate gen AI into their customer service capabilities. Customer service is a commodity capability, not part of the core business, for most companies. While gen AI might help with productivity in such cases, it won’t create a competitive advantage.

To create competitive advantage, companies should first understand the difference between being a “taker” (a user of available tools, often via APIs and subscription services), a “shaper” (an integrator of available models with proprietary data), and a “maker” (a builder of LLMs). For now, the maker approach is too expensive for most companies, so the sweet spot for businesses is implementing a taker model for productivity improvements while building shaper applications for competitive advantage.

Much of gen AI’s near-term value is closely tied to its ability to help people do their current jobs better. In this way, gen AI tools act as copilots that work side by side with an employee, creating an initial block of code that a developer can adapt, for example, or drafting a requisition order for a new part that a maintenance worker in the field can review and submit (see sidebar “Copilot examples across three generative AI archetypes”). This means companies should be focusing on where copilot technology can have the biggest impact on their priority programs.

Copilot examples across three generative AI archetypes

  • “Taker” copilots help real estate customers sift through property options and find the most promising one, write code for a developer, and summarize investor transcripts.
  • “Shaper” copilots provide recommendations to sales reps for upselling customers by connecting generative AI tools to customer relationship management systems, financial systems, and customer behavior histories; create virtual assistants to personalize treatments for patients; and recommend solutions for maintenance workers based on historical data.
  • “Maker” copilots are foundation models that lab scientists at pharmaceutical companies can use to find and test new and better drugs more quickly.

Some industrial companies, for example, have identified maintenance as a critical domain for their business. Reviewing maintenance reports and spending time with workers on the front lines can help determine where a gen AI copilot could make a big difference, such as in identifying issues with equipment failures quickly and early on. A gen AI copilot can also help identify root causes of truck breakdowns and recommend resolutions much more quickly than usual, as well as act as an ongoing source for best practices or standard operating procedures.

The challenge with copilots is figuring out how to generate revenue from increased productivity. In the case of customer service centers, for example, companies can stop recruiting new agents and use attrition to potentially achieve real financial gains. Defining the plans for how to generate revenue from the increased productivity up front, therefore, is crucial to capturing the value.

Upskill the talent you have but be clear about the gen-AI-specific skills you need

By now, most companies have a decent understanding of the technical gen AI skills they need, such as model fine-tuning, vector database administration, prompt engineering, and context engineering. In many cases, these are skills that you can train your existing workforce to develop. Those with existing AI and machine learning (ML) capabilities have a strong head start. Data engineers, for example, can learn multimodal processing and vector database management, MLOps (ML operations) engineers can extend their skills to LLMOps (LLM operations), and data scientists can develop prompt engineering, bias detection, and fine-tuning skills.

A sample of new generative AI skills needed

The following are examples of new skills needed for the successful deployment of generative AI tools:

  • data scientist:
  • prompt engineering
  • in-context learning
  • bias detection
  • pattern identification
  • reinforcement learning from human feedback
  • hyperparameter/large language model fine-tuning; transfer learning
  • data engineer:
  • data wrangling and data warehousing
  • data pipeline construction
  • multimodal processing
  • vector database management

The learning process can take two to three months to get to a decent level of competence because of the complexities in learning what various LLMs can and can’t do and how best to use them. The coders need to gain experience building software, testing, and validating answers, for example. It took one financial-services company three months to train its best data scientists to a high level of competence. While courses and documentation are available—many LLM providers have boot camps for developers—we have found that the most effective way to build capabilities at scale is through apprenticeship, training people to then train others, and building communities of practitioners. Rotating experts through teams to train others, scheduling regular sessions for people to share learnings, and hosting biweekly documentation review sessions are practices that have proven successful in building communities of practitioners (see sidebar “A sample of new generative AI skills needed”).

It’s important to bear in mind that successful gen AI skills are about more than coding proficiency. Our experience in developing our own gen AI platform, Lilli , showed us that the best gen AI technical talent has design skills to uncover where to focus solutions, contextual understanding to ensure the most relevant and high-quality answers are generated, collaboration skills to work well with knowledge experts (to test and validate answers and develop an appropriate curation approach), strong forensic skills to figure out causes of breakdowns (is the issue the data, the interpretation of the user’s intent, the quality of metadata on embeddings, or something else?), and anticipation skills to conceive of and plan for possible outcomes and to put the right kind of tracking into their code. A pure coder who doesn’t intrinsically have these skills may not be as useful a team member.

While current upskilling is largely based on a “learn on the job” approach, we see a rapid market emerging for people who have learned these skills over the past year. That skill growth is moving quickly. GitHub reported that developers were working on gen AI projects “in big numbers,” and that 65,000 public gen AI projects were created on its platform in 2023—a jump of almost 250 percent over the previous year. If your company is just starting its gen AI journey, you could consider hiring two or three senior engineers who have built a gen AI shaper product for their companies. This could greatly accelerate your efforts.

Form a centralized team to establish standards that enable responsible scaling

To ensure that all parts of the business can scale gen AI capabilities, centralizing competencies is a natural first move. The critical focus for this central team will be to develop and put in place protocols and standards to support scale, ensuring that teams can access models while also minimizing risk and containing costs. The team’s work could include, for example, procuring models and prescribing ways to access them, developing standards for data readiness, setting up approved prompt libraries, and allocating resources.

While developing Lilli, our team had its mind on scale when it created an open plug-in architecture and setting standards for how APIs should function and be built.  They developed standardized tooling and infrastructure where teams could securely experiment and access a GPT LLM , a gateway with preapproved APIs that teams could access, and a self-serve developer portal. Our goal is that this approach, over time, can help shift “Lilli as a product” (that a handful of teams use to build specific solutions) to “Lilli as a platform” (that teams across the enterprise can access to build other products).

For teams developing gen AI solutions, squad composition will be similar to AI teams but with data engineers and data scientists with gen AI experience and more contributors from risk management, compliance, and legal functions. The general idea of staffing squads with resources that are federated from the different expertise areas will not change, but the skill composition of a gen-AI-intensive squad will.

Set up the technology architecture to scale

Building a gen AI model is often relatively straightforward, but making it fully operational at scale is a different matter entirely. We’ve seen engineers build a basic chatbot in a week, but releasing a stable, accurate, and compliant version that scales can take four months. That’s why, our experience shows, the actual model costs may be less than 10 to 15 percent of the total costs of the solution.

Building for scale doesn’t mean building a new technology architecture. But it does mean focusing on a few core decisions that simplify and speed up processes without breaking the bank. Three such decisions stand out:

  • Focus on reusing your technology. Reusing code can increase the development speed of gen AI use cases by 30 to 50 percent. One good approach is simply creating a source for approved tools, code, and components. A financial-services company, for example, created a library of production-grade tools, which had been approved by both the security and legal teams, and made them available in a library for teams to use. More important is taking the time to identify and build those capabilities that are common across the most priority use cases. The same financial-services company, for example, identified three components that could be reused for more than 100 identified use cases. By building those first, they were able to generate a significant portion of the code base for all the identified use cases—essentially giving every application a big head start.
  • Focus the architecture on enabling efficient connections between gen AI models and internal systems. For gen AI models to work effectively in the shaper archetype, they need access to a business’s data and applications. Advances in integration and orchestration frameworks have significantly reduced the effort required to make those connections. But laying out what those integrations are and how to enable them is critical to ensure these models work efficiently and to avoid the complexity that creates technical debt  (the “tax” a company pays in terms of time and resources needed to redress existing technology issues). Chief information officers and chief technology officers can define reference architectures and integration standards for their organizations. Key elements should include a model hub, which contains trained and approved models that can be provisioned on demand; standard APIs that act as bridges connecting gen AI models to applications or data; and context management and caching, which speed up processing by providing models with relevant information from enterprise data sources.
  • Build up your testing and quality assurance capabilities. Our own experience building Lilli taught us to prioritize testing over development. Our team invested in not only developing testing protocols for each stage of development but also aligning the entire team so that, for example, it was clear who specifically needed to sign off on each stage of the process. This slowed down initial development but sped up the overall delivery pace and quality by cutting back on errors and the time needed to fix mistakes.

Ensure data quality and focus on unstructured data to fuel your models

The ability of a business to generate and scale value from gen AI models will depend on how well it takes advantage of its own data. As with technology, targeted upgrades to existing data architecture  are needed to maximize the future strategic benefits of gen AI:

  • Be targeted in ramping up your data quality and data augmentation efforts. While data quality has always been an important issue, the scale and scope of data that gen AI models can use—especially unstructured data—has made this issue much more consequential. For this reason, it’s critical to get the data foundations right, from clarifying decision rights to defining clear data processes to establishing taxonomies so models can access the data they need. The companies that do this well tie their data quality and augmentation efforts to the specific AI/gen AI application and use case—you don’t need this data foundation to extend to every corner of the enterprise. This could mean, for example, developing a new data repository for all equipment specifications and reported issues to better support maintenance copilot applications.
  • Understand what value is locked into your unstructured data. Most organizations have traditionally focused their data efforts on structured data (values that can be organized in tables, such as prices and features). But the real value from LLMs comes from their ability to work with unstructured data (for example, PowerPoint slides, videos, and text). Companies can map out which unstructured data sources are most valuable and establish metadata tagging standards so models can process the data and teams can find what they need (tagging is particularly important to help companies remove data from models as well, if necessary). Be creative in thinking about data opportunities. Some companies, for example, are interviewing senior employees as they retire and feeding that captured institutional knowledge into an LLM to help improve their copilot performance.
  • Optimize to lower costs at scale. There is often as much as a tenfold difference between what companies pay for data and what they could be paying if they optimized their data infrastructure and underlying costs. This issue often stems from companies scaling their proofs of concept without optimizing their data approach. Two costs generally stand out. One is storage costs arising from companies uploading terabytes of data into the cloud and wanting that data available 24/7. In practice, companies rarely need more than 10 percent of their data to have that level of availability, and accessing the rest over a 24- or 48-hour period is a much cheaper option. The other costs relate to computation with models that require on-call access to thousands of processors to run. This is especially the case when companies are building their own models (the maker archetype) but also when they are using pretrained models and running them with their own data and use cases (the shaper archetype). Companies could take a close look at how they can optimize computation costs on cloud platforms—for instance, putting some models in a queue to run when processors aren’t being used (such as when Americans go to bed and consumption of computing services like Netflix decreases) is a much cheaper option.

Build trust and reusability to drive adoption and scale

Because many people have concerns about gen AI, the bar on explaining how these tools work is much higher than for most solutions. People who use the tools want to know how they work, not just what they do. So it’s important to invest extra time and money to build trust by ensuring model accuracy and making it easy to check answers.

One insurance company, for example, created a gen AI tool to help manage claims. As part of the tool, it listed all the guardrails that had been put in place, and for each answer provided a link to the sentence or page of the relevant policy documents. The company also used an LLM to generate many variations of the same question to ensure answer consistency. These steps, among others, were critical to helping end users build trust in the tool.

Part of the training for maintenance teams using a gen AI tool should be to help them understand the limitations of models and how best to get the right answers. That includes teaching workers strategies to get to the best answer as fast as possible by starting with broad questions then narrowing them down. This provides the model with more context, and it also helps remove any bias of the people who might think they know the answer already. Having model interfaces that look and feel the same as existing tools also helps users feel less pressured to learn something new each time a new application is introduced.

Getting to scale means that businesses will need to stop building one-off solutions that are hard to use for other similar use cases. One global energy and materials company, for example, has established ease of reuse as a key requirement for all gen AI models, and has found in early iterations that 50 to 60 percent of its components can be reused. This means setting standards for developing gen AI assets (for example, prompts and context) that can be easily reused for other cases.

While many of the risk issues relating to gen AI are evolutions of discussions that were already brewing—for instance, data privacy, security, bias risk, job displacement, and intellectual property protection—gen AI has greatly expanded that risk landscape. Just 21 percent of companies reporting AI adoption say they have established policies governing employees’ use of gen AI technologies.

Similarly, a set of tests for AI/gen AI solutions should be established to demonstrate that data privacy, debiasing, and intellectual property protection are respected. Some organizations, in fact, are proposing to release models accompanied with documentation that details their performance characteristics. Documenting your decisions and rationales can be particularly helpful in conversations with regulators.

In some ways, this article is premature—so much is changing that we’ll likely have a profoundly different understanding of gen AI and its capabilities in a year’s time. But the core truths of finding value and driving change will still apply. How well companies have learned those lessons may largely determine how successful they’ll be in capturing that value.

Eric Lamarre

The authors wish to thank Michael Chui, Juan Couto, Ben Ellencweig, Josh Gartner, Bryce Hall, Holger Harreis, Phil Hudelson, Suzana Iacob, Sid Kamath, Neerav Kingsland, Kitti Lakner, Robert Levin, Matej Macak, Lapo Mori, Alex Peluffo, Aldo Rosales, Erik Roth, Abdul Wahab Shaikh, and Stephen Xu for their contributions to this article.

This article was edited by Barr Seitz, an editorial director in the New York office.

Explore a career with us

Related articles.

Light dots and lines evolve into a pattern of a human face and continue to stream off the the side in a moving grid pattern.

The economic potential of generative AI: The next productivity frontier

A yellow wire shaped into a butterfly

Rewired to outcompete

A digital construction of a human face consisting of blocks

Meet Lilli, our generative AI tool that’s a researcher, a time saver, and an inspiration

NeurIPS 2024, the Thirty-eighth Annual Conference on Neural Information Processing Systems, will be held at the Vancouver Convention Center

Monday Dec 9 through Sunday Dec 15. Monday is an industry expo.

machine learning in healthcare research papers pdf

Registration

Registration details will be posted soon. 

Our Hotel Reservation page is currently under construction and will be released shortly. NeurIPS has contracted Hotel guest rooms for the Conference at group pricing, requiring reservations only through this page. Please do not make room reservations through any other channel, as it only impedes us from putting on the best Conference for you. We thank you for your assistance in helping us protect the NeurIPS conference.

Announcements

Latest neurips blog entries [ all entries ], important dates.

If you have questions about supporting the conference, please contact us .

Become an 2024 Exhibitor Exhibitor Info »

Organizing Committee

Workflow manager, logistics and it, mission statement.

The Neural Information Processing Systems Foundation is a non-profit corporation whose purpose is to foster the exchange of research advances in Artificial Intelligence and Machine Learning, principally by hosting an annual interdisciplinary academic conference with the highest ethical standards for a diverse and inclusive community.

About the Conference

The conference was founded in 1987 and is now a multi-track interdisciplinary annual meeting that includes invited talks, demonstrations, symposia, and oral and poster presentations of refereed papers. Along with the conference is a professional exposition focusing on machine learning in practice, a series of tutorials, and topical workshops that provide a less formal setting for the exchange of ideas.

More about the Neural Information Processing Systems foundation »

IMAGES

  1. (PDF) Machine learning in medicine: a practical introduction

    machine learning in healthcare research papers pdf

  2. Machine Learning In Healthcare Informatics Pdf

    machine learning in healthcare research papers pdf

  3. (PDF) Machine Learning for Health Informatics

    machine learning in healthcare research papers pdf

  4. (PDF) Stress Detection with Machine Learning and Deep Learning using

    machine learning in healthcare research papers pdf

  5. role of artificial intelligence in healthcare ppt

    machine learning in healthcare research papers pdf

  6. (PDF) Comparative Study of Machine Learning Techniques in the Medical

    machine learning in healthcare research papers pdf

VIDEO

  1. Lec07_part1_introduction to machine learning

  2. Machine Learning Research Explained to a 5 Year Old #AI #musicgeneration

  3. 3 Federated Learning Frameworks and Collaborative Learning

  4. Basics of Machine Learning

  5. The Usage of Machine Learning and Deep Neural Network in the Medical Field

  6. AI in Personal Healthcare A Practical Guide

COMMENTS

  1. (PDF) Applications of Machine Learning in Healthcare

    Abstract and Figures. Machine learning techniques in healthcare use the increasing amount of health data provided by the Internet of Things to improve patient outcomes. These techniques provide ...

  2. A Comprehensive Review on Machine Learning in Healthcare Industry

    2. Overview of Machine-Learning in Healthcare. Machine learning is a type of artificial intelligence that involves training algorithms on data so that they can make predictions or take actions without being explicitly programmed. In healthcare, machine learning has the potential to revolutionize how we diagnose, treat, and prevent diseases, as ...

  3. Machine Learning in Healthcare

    The compilation of articles and papers focused on the use of machine learning and artificial intelligence in healthcare as well as current and potential applications. ... In a recent research study, Liu, Zhang, and Razavian developed a deep learning algorithm using LSTM networks (reinforcement learning) and CNNs (supervised learning) to predict ...

  4. Shifting machine learning for healthcare from development to ...

    In the past decade, the application of machine learning (ML) to healthcare has helped drive the automation of physician tasks as well as enhancements in clinical capabilities and access to care.

  5. Significance of machine learning in healthcare: Features, pillars and

    Machine Learning (ML) applications are making a considerable impact on healthcare. ML is a subtype of Artificial Intelligence (AI) technology that aims to improve the speed and accuracy of physicians' work. Countries are currently dealing with an overburdened healthcare system with a shortage of skilled physicians, where AI provides a big hope.

  6. Machine Learning in Healthcare: A Review

    Machine Learning is modern and highly sophisticated technological applications became a huge trend in the industry. Machine Learning is Omni present and is widely used in various applications. It is playing a vital role in many fields like finance, Medical science and in security. Machine learning is used to discover patterns from medical data sources and provide excellent capabilities to ...

  7. PDF Machine Learning (ML) in Medicine: Review, Applications, and Challenges

    Smiti [30] examined the main concepts of machine learning in healthcare. In this paper, in the first step, the healthcare process and its various phases are described in summary. According to [30], the healthcare process has four parts: prevention, detection, diagnosis, and treatment. Then, the machine learning process is briefly explained ...

  8. Full article: Systematic reviews of machine learning in healthcare: a

    Results. In total 220 SLRs covering 10,462 ML algorithms were reviewed. The main application of AI in medicine related to the clinical prediction and disease prognosis in oncology and neurology with the use of imaging data. Accuracy, specificity, and sensitivity were provided in 56%, 28%, and 25% SLRs respectively.

  9. Machine learning in healthcare: review, opportunities and challenges

    The integration of healthcare and technology indicates some characteristics such as; 1. Adoption of smart systems to handle health issues of human beings. 2. Improvement in quality of life of human beings using intelligent and smart systems. 3. Distribution of human workload to machine. 4.

  10. (PDF) Machine Learning with Health Care Perspective Machine Learning

    The first step, as presented in this paper, is to build a system based on the fusion of decisions from multiple neural networks, such as DarkNet-53, DenseNet-201, GoogLeNet, Inception-V3 ...

  11. PDF Machine learning in healthcare

    Machine learning in healthcare - a system's perspective [Position paper] Awais Ashfaq Center of Applied Intelligent Systems Research, Halmstad University. Sweden Halland Hospital, Region Halland. Sweden [email protected] Slawomir Nowaczyk Center of Applied Intelligent Systems Research, Halmstad University. Sweden [email protected] ABSTRACT

  12. Machine Learning in Healthcare

    Abstract. Recent advancements in Artificial Intelligence (AI) and Machine Learning (ML) technology have brought on substantial strides in predicting and identifying health emergencies, disease populations, and disease state and immune response, amongst a few. Although, skepticism remains regarding the practical application and interpretation of ...

  13. [PDF] Machine Learning in Healthcare

    An overview of machine learning-based approaches and learning algorithms including supervised, unsupervised, and reinforcement learning along with examples are provided and the application of ML in several healthcare fields are discussed, including radiology, genetics, electronic health records, and neuroimaging. Recent advancements in Artificial Intelligence (AI) and Machine Learning (ML ...

  14. (PDF) The Future of Health care: Machine Learning

    Abstract. Machine learning (ML) is a rising field. Machine learning is to find patterns automatically and reason about data.ML enables personalized care called precision medicine. Machine learning ...

  15. Using machine learning for healthcare challenges and opportunities

    Machine learning (ML) and its applications in healthcare have gained a lot of attention. When enhanced computational power is combined with big data, there is an opportunity to use ML algorithms to improve health care. Supervised learning is the type of ML that can be implemented to predict labeled data based on algorithms such as linear or ...

  16. The role of machine learning in clinical research: transforming the

    Background Interest in the application of machine learning (ML) to the design, conduct, and analysis of clinical trials has grown, but the evidence base for such applications has not been surveyed. This manuscript reviews the proceedings of a multi-stakeholder conference to discuss the current and future state of ML for clinical research. Key areas of clinical trial methodology in which ML ...

  17. Machine Learning in Healthcare: A Review

    Abstract. This study attempts to introduce artificial intelligence and its significant subfields in machine learning algorithms and reviews the role of these subfields in various areas in healthcare such as bioinformatics, gene detection for cancer diagnosis, epileptic seizure, brain-computer interface. It also reviews the medical image ...

  18. Application of Machine Learning in Healthcare: An Analysis

    Health care field is facing a lot of challenges due to the huge volume of people need medical support. The pandemic situation has created a lot of challenges to the healthcare field. This paper analyses how advancement in machine learning can be best utilized in improving health care services. Machine learning techniques are based on the idea of how systems can learn from the already existing ...

  19. Systematic Mapping Study of AI/Machine Learning in Healthcare and

    This study attempts to categorise research conducted in the area of: use of machine learning in healthcare, using a systematic mapping study methodology. In our attempt, we reviewed literature from top journals, articles, and conference papers by using the keywords use of machine learning in healthcare. We queried Google Scholar, resulted in 1400 papers, and then categorised the results on the ...

  20. PDF Chapter Applications of Machine Learning in Healthcare

    1. Introduction. The advent of digital technologies in the healthcare field is characterized by continual challenges in both application and practicality. Unification of disparate health systems have been slow and the adoption of a fully integrated healthcare system in most parts of the world has not been accomplished.

  21. (PDF) Machine learning applications in healthcare sector: An overview

    a. Machine Learning: Machine learning is a branch of AI that enables computers to learn and make predictions or decisions without being explicitly programmed. In healthcare, machine learning ...

  22. Identifying Psychosis Episodes in Psychiatric Admission Notes via Rule

    Early and accurate diagnosis is crucial for effective treatment and improved outcomes, yet identifying psychotic episodes presents significant challenges due to its complex nature and the varied presentation of symptoms among individuals. One of the primary difficulties lies in the underreporting and underdiagnosis of psychosis, compounded by the stigma surrounding mental health and the ...

  23. Artificial intelligence and illusions of understanding in scientific

    The proliferation of artificial intelligence tools in scientific research risks creating illusions of understanding, where scientists believe they understand more about the world than they ...

  24. JMIR Public Health and Surveillance

    Background: Despite COVID-19 vaccine mandates, many chose to forgo vaccination, raising questions about the psychology underlying how judgment affects these choices. Research shows that reward and aversion judgments are important for vaccination choice; however, no studies have integrated such cognitive science with machine learning to predict COVID-19 vaccine uptake.

  25. (PDF) Machine Learning in HealthCare

    Abstract. Machine learning algorithms make a significant contribution to disease prediction. Disease Prediction system is based on predictive modelling which predicts the disease of the user on ...

  26. A generative AI reset: Rewiring to turn potential into value in 2024

    Those with existing AI and machine learning (ML) capabilities have a strong head start. Data engineers, for example, can learn multimodal processing and vector database management, MLOps (ML operations) engineers can extend their skills to LLMOps (LLM operations), and data scientists can develop prompt engineering, bias detection, and fine ...

  27. Machine Learning Algorithms in Healthcare: A Literature Survey

    PDF | On Jul 1, 2020, Munira Ferdous and others published Machine Learning Algorithms in Healthcare: A Literature Survey | Find, read and cite all the research you need on ResearchGate

  28. NeurIPS 2024

    The Neural Information Processing Systems Foundation is a non-profit corporation whose purpose is to foster the exchange of research advances in Artificial Intelligence and Machine Learning, principally by hosting an annual interdisciplinary academic conference with the highest ethical standards for a diverse and inclusive community.