Advertisement

Advertisement

Electronic health records to facilitate clinical research

  • Open access
  • Published: 24 August 2016
  • Volume 106 , pages 1–9, ( 2017 )

Cite this article

You have full access to this open access article

  • Martin R. Cowie 1 ,
  • Juuso I. Blomster 2 , 3 ,
  • Lesley H. Curtis 4 ,
  • Sylvie Duclaux 5 ,
  • Ian Ford 6 ,
  • Fleur Fritz 7 ,
  • Samantha Goldman 8 ,
  • Salim Janmohamed 9 ,
  • Jörg Kreuzer 10 ,
  • Mark Leenay 11 ,
  • Alexander Michel 12 ,
  • Seleen Ong 13 ,
  • Jill P. Pell 14 ,
  • Mary Ross Southworth 15 ,
  • Wendy Gattis Stough 16 ,
  • Martin Thoenes 17 ,
  • Faiez Zannad 18 , 19 &
  • Andrew Zalewski 20  

32k Accesses

327 Citations

298 Altmetric

37 Mentions

Explore all metrics

Electronic health records (EHRs) provide opportunities to enhance patient care, embed performance measures in clinical practice, and facilitate clinical research. Concerns have been raised about the increasing recruitment challenges in trials, burdensome and obtrusive data collection, and uncertain generalizability of the results. Leveraging electronic health records to counterbalance these trends is an area of intense interest. The initial applications of electronic health records, as the primary data source is envisioned for observational studies, embedded pragmatic or post-marketing registry-based randomized studies, or comparative effectiveness studies. Advancing this approach to randomized clinical trials, electronic health records may potentially be used to assess study feasibility, to facilitate patient recruitment, and streamline data collection at baseline and follow-up. Ensuring data security and privacy, overcoming the challenges associated with linking diverse systems and maintaining infrastructure for repeat use of high quality data, are some of the challenges associated with using electronic health records in clinical research. Collaboration between academia, industry, regulatory bodies, policy makers, patients, and electronic health record vendors is critical for the greater use of electronic health records in clinical research. This manuscript identifies the key steps required to advance the role of electronic health records in cardiovascular clinical research.

Similar content being viewed by others

analysis of ehr data for clinical research

Revolutionizing healthcare: the role of artificial intelligence in clinical practice

Shuroug A. Alowais, Sahar S. Alghamdi, … Abdulkareem M. Albekairy

analysis of ehr data for clinical research

Big Data Analytics in Healthcare

analysis of ehr data for clinical research

Defining the Study Cohort: Inclusion and Exclusion Criteria

Avoid common mistakes on your manuscript.

Introduction

Electronic health records (EHRs) provide opportunities to enhance patient care, to embed performance measures in clinical practice, and to improve the identification and recruitment of eligible patients and healthcare providers in clinical research. On a macroeconomic scale, EHRs (by enabling pragmatic clinical trials) may assist in the assessment of whether new treatments or innovation in healthcare delivery result in improved outcomes or healthcare savings.

Concerns have been raised about the current state of cardiovascular clinical research: the increasing recruitment challenges; burdensome data collection; and uncertain generalizability to clinical practice [ 1 ]. These factors add to the increasing costs of clinical research [ 2 ] and are thought to contribute to declining investment in the field [ 1 ].

The Cardiovascular Round Table (CRT) of the European Society of Cardiology (ESC) convened a two-day workshop among international experts in cardiovascular clinical research and health informatics to explore how EHRs could advance cardiovascular clinical research. This paper summarizes the key insights and discussions from the workshop, acknowledges the barriers to EHR implementation in clinical research, and identifies practical solutions for engaging stakeholders (i.e., academia, industry, regulatory bodies, policy makers, patients, and EHR vendors) in the implementation of EHRs in clinical research.

Overview of electronic health records

Broadly defined, EHRs represent longitudinal data (in electronic format) that are collected during routine delivery of health care [ 3 ]. EHRs generally contain demographic, vital statistics, administrative, claims (medical and pharmacy), clinical, and patient-centered (e.g., originating from health-related quality-of-life instruments, home-monitoring devices, and frailty or caregiver assessments) data. The scope of an EHR varies widely across the world. Systems originating primarily as billing systems were not designed to support clinical work flow. Moving forward, EHR should be designed to optimize diagnosis and clinical care, which will enhance their relevance for clinical research. The EHR may reflect single components of care (e.g., primary care, emergency department, and intensive care unit) or data from an integrated hospital-wide or inter-hospital linked system [ 4 ]. EHRs may also change over time, reflecting evolving technology capabilities or external influences (e.g., changes in type of data collected related to coding or reimbursement practices).

EHRs emerged largely as a means to improve healthcare quality [ 5 – 7 ] and to capture billing data. EHRs may potentially be used to assess study feasibility, facilitate patient recruitment, streamline data collection, or conduct entirely EHR-based observational, embedded pragmatic, or post-marketing randomized registry studies, or comparative effectiveness studies. The various applications of EHRs for observational studies, safety surveillance, clinical research, and regulatory purposes are shown in Table  1 [ 3 , 8 – 10 ].

Electronic health records for research applications

Epidemiologic and observational research.

EHR data have been used to support observational studies, either as stand-alone data or following linkage to primary research data or other administrative data sets [ 3 , 11 – 14 ]. For example, the initial Euro Heart Survey [ 15 ] and subsequent Eurobservational Research Program (EORP) [ 16 ], the American College of Cardiology National Cardiovascular Data Registry (ACC-NCDR) [ 14 ], National Registry of Myocardial Infarction (NRMI), and American Heart Association Get With the Guidelines (AHA GWTG) [ 17 ] represent clinical data (collected from health records into an electronic case report form [eCRF] designed for the specific registry) on the management of patients across a spectrum of different cardiovascular diseases. However, modern EHR systems can minimize or eliminate the need for duplicate data collection (i.e., in a separate registry-specific eCRF), are capable of integrating large amounts of medical information accumulated throughout the patient’s life, enabling longitudinal study of diseases using the existing informatics infrastructure [ 18 ]. For example, EHR systems increasingly house imaging data which provide more detailed disease characterization than previously available in most observational data sets. In some countries (e.g., Farr Institute in Scotland [ 19 ]), the EHR can be linked, at an individual level, to other data sets, including general population health and lifestyle surveys, disease registries, and data collected by other sectors (e.g., education, housing, social care, and criminal justice). EHR data support a wide range of epidemiological research on the natural history of disease, drug utilization, and safety, as well as health services research.

Safety surveillance and regulatory uses

Active post-marketing safety surveillance and signal detection are important, emerging applications for EHRs, because they can provide realistic rates of events (unlike spontaneous event reports) and information on real-world use of drugs [ 20 ]. The EU-ADR project linked 8 databases in four European countries (Denmark, Italy, The Netherlands, United Kingdom) to enable analysis of select target adverse drug events [ 21 ]. The European Medicines Agency (EMA) coordinates the European Network of Centres for Pharmacoepidemiology and Pharmacovigilance (ENCePP) which aims to conduct post-marketing risk assessment using various EHR sources [ 22 , 23 ]. In the United States, the Food and Drug Administration (FDA) uses EHR data from several different sources (e.g., Sentinel and Mini-Sentinel System [ 24 ], Centers for Medicare and Medicaid Services [CMS], Veterans Affairs, Department of Defense, Substance Abuse and Mental Health Services Administration) to support post-marketing safety investigations [ 25 ].

Prospective clinical research

National patient registries that contain data extracted from the EHR are an accepted modality to assess guideline adherence and the effectiveness of performance improvement initiatives [ 26 – 33 ]. However, the use of EHRs for prospective clinical research is still limited, despite the fact that data collected for routine medical care overlap considerably with data collected for research. The most straightforward and generally accepted application for EHR is assessing trial feasibility and facilitating patient recruitment, and EHRs are currently used for this purpose in some centers. Using EHR technology to generate lists of patients who might be eligible for research is recognized as an option to meet meaningful use standards for EHR in the United States [ 6 ]. However, incomplete data may prohibit screening for the complete list of eligibility criteria [ 34 ], but EHRs may facilitate pre-screening of patients by age, gender, and diagnosis, particularly for exclusion of ineligible patients, and reduce the overall screening burden in clinical trials [ 35 ]. A second, and more complex, step involves the reuse of information collected in EHRs for routine clinical care as source data for research. Using EHRs as the source for demographic information, co-morbidities, and concomitant medications has several advantages over separately recording these data into an eCRF. Transcription errors may be reduced, since EHR data are entered by providers directly involved in a patient’s care as opposed to secondary eCRF entry by study personnel. The eCRF may be a redundant and costly step in a clinical trial, since local health records (electronic or paper) are used to verify source data entered into the eCRF. Finally, EHRs might enhance patient safety and reduce timelines if real-time EHR systems are used in clinical trials, in contrast to delays encountered with manual data entry into an eCRF. The EHR may facilitate implementation of remote data monitoring, which has the potential to greatly reduce clinical trial costs. The Innovative Medicine Initiative (IMI) Electronic Health Records for Clinical Research (EHR4CR, http://www.ehr4cr.eu ) project is one example, where tools and processes are being developed to facilitate reuse of EHR data for clinical research purposes. Systems to assess protocol feasibility and identify eligible patients for recruitment have been implemented, and efforts to link EHRs with clinical research electronic data collection are ongoing [ 36 ].

A shift towards pragmatic trials has been proposed as a mechanism to improve clinical trial efficiency [ 37 ]. Most of the data in a pragmatic trial are collected in the context of routine clinical care, which reduce trial-specific clinic visits and assessments, and should also reduce costs [ 38 ]. This concept is being applied in the National Institutes of Health (NIH) Health Care Systems Research Collaboratory. Trials conducted within the NIH Collaboratory aim to answer questions related to care delivery and the EHR contains relevant data for this purpose. Studies may have additional data collection modules if variables not routinely captured in the EHR are needed for a specific study. Similarly, the Patient-Centered Outcomes Research Institute (PCORI) has launched PCORnet, a research network that uses a common data platform alongside the existing EHR to conduct observational and interventional comparative effectiveness research [ 9 , 39 , 40 ].

The integration of EHRs in the conventional randomized controlled trials intended to support a new indication is more complex. EHRs may be an alternative to eCRFs when data collection is focused and limited to critical variables that are consistently collected in routine clinical care. Regulatory feedback indicates that while a new indication for a marketed drug might be achieved through EHRs, first marketing authorization using data entirely from EHRs would most likely not be possible with current systems until validation studies are performed and reviewed by regulatory agencies. The EHR could also be used to collect serious adverse events (SAE) that result in hospitalization, or to collect endpoints that do not necessarily require blinded adjudication (e.g., death), although the utility of EHRs for this purpose is dependent on the type of endpoint, whether it can reliably be identified in the EHR, and the timeliness of EHR data availability. Events that are coded for reimbursement (e.g., hospitalizations, MI) or new diagnoses, where disease-specific therapy is initiated (e.g., initiation of glucose lowering drugs to define new onset diabetes) tend to be more reliable. The reliability of endpoint collection varies by region and depends on the extent of linkage between different databases.

Challenges to using electronic health records in clinical trials and steps toward solutions

Challenges to using EHRs in clinical trials have been identified, related to data quality and validation, complete data capture, heterogeneity between systems, and developing a working knowledge across systems (Table  2 ). Ongoing projects, such as those conducted within the NIH Collaboratory and PCORnet [ 39 , 41 ] in the United States or the Farr Institute of Health Informatics Research in Scotland, have demonstrated the feasibility of using EHRs for aspects of clinical research, particularly comparative effectiveness. The success of these endeavors is connected to careful planning by a multi-stakeholder group committed to patient privacy, data security, fair governance, robust data infrastructure, and quality science from the outset. The next hurdle is to adapt the accrued knowledge for application to a broader base of clinical trials.

Data quality and validation

Data quality and validation are key factors in determining whether EHRs might be suitable data sources in clinical trials. Concerns about coding inaccuracies or bias introduced by selection of codes driven by billing incentives rather than clinical care may be diminished when healthcare providers enter data directly into the EHRs or when EHRs are used throughout all areas of the health-system, but such systems have not yet been widely implemented [ 42 ]. Excessive or busy workloads may also contribute to errors in clinician data entry [ 43 ]. Indeed, errors in EHRs have been reported [ 43 – 45 ].

Complete data capture is also a critical aspect of using EHRs for clinical research, particularly if EHRs are used for endpoint ascertainment or SAE collection. Complete data capture can be a major barrier in regions, where patients receive care from different providers or hospitals operating in different EHR systems that are not linked.

Consistent, validated methods for assessing data quality and completeness have not yet been adopted [ 46 ], but validation is a critical factor for the regulatory acceptance of EHR data. Proposed validation approaches include using both an eCRF and EHRs in a study in parallel and comparing results using the two data collection methods. This approach will require collaborative efforts to embed EHR substudies in large cardiovascular studies conducted by several sponsors. Assessing selected outcomes of interest from several EHR-based trials to compare different methodologies with an agreed statistical framework will be required to gauge precision of data collection via EHRs. A hybrid approach has also been proposed, where the EHR is used to identify study endpoints (e.g., death, hospitalization, myocardial infarction, and cancer), followed by adjudication and validation of EHR findings using clinical data (e.g., electrocardiogram and laboratory data).

Validity should be defined a priori and should be specific to the endpoints of interest as well as relevant to the country or healthcare system. Validation studies should aim to assess both the consistency between EHR data and standard data collection methods, and also how identified differences influence a study’s results. Proposed uses of EHRs for registration trials and methods for their validation will likely be considered by regulatory agencies on a case-by-case basis, because of the limited experience with EHRs for this purpose at the current time. Collaboration among industry sponsors to share cumulative experiences with EHR validation studies might lead to faster acceptance by regulatory authorities.

The ESC-CRT recommends that initial efforts to integrate EHRs in clinical trials focus on a few efficacy endpoints of interest, preferably objective endpoints (e.g., all-cause or cause-specific mortality) that are less susceptible to bias or subjective interpretation. As noted above, mortality may be incompletely captured in EHRs, particularly if patients die outside of the hospital, or at another institution using a non-integrated EHR. Thus, methods to supplement endpoint ascertainment in the EHR may be necessary if data completeness is uncertain. Standardized endpoint definitions based on the EHR should be included in the study protocol and analysis plan. A narrow set of data elements for auditing should be prospectively defined to ensure the required variables which are contained in the EHR.

Early interaction between sponsors, clinical investigators, and regulators is recommended to enable robust designs for clinical trials aiming to use EHRs for endpoint ascertainment. Plans to translate Good Clinical Practice into an EHR facilitated research environment should be described. Gaps in personnel training and education should be identified and specific actions to address training deficiencies should be communicated to regulators and in place prior to the start of the trial.

Timely access to electronic health record data

The potential for delays in data access is an important consideration when EHRs are used in clinical trials. EHRs may contain data originally collected as free text that was later coded for the EHR. Thus, coded information may not be available for patient identification/recruitment during the admission. Similarly, coding may occur weeks or months after discharge. In nationally integrated systems, data availability may also be delayed. These delays may be critical depending on the purpose of data extracted from the EHR (e.g., SAE reporting, source data, or endpoints in a time-sensitive study).

Heterogeneity between systems

Patients may be treated by multiple healthcare providers who operate independently of one another. Such patients may have more than one EHR, and these EHRs may not be linked. This heterogeneity adds to the complexity of using EHRs for clinical trials, since data coordinating centres have to develop processes for interacting or extracting data from any number of different systems. Differences in quality [ 47 ], non-standardized terminology, incomplete data capture, issues related to data sharing and data privacy, lack of common data fields, and the inability of systems to be configured to communicate with each other may also be problematic. Achieving agreement on a minimum set of common data fields to enable cross communication between systems would be a major step forward towards enabling EHRs to be used in clinical trials across centers and regions [ 48 , 49 ].

Data security and privacy

Privacy issues and information governance are among the most complex aspects of implementing EHRs for clinical research, in part because attitudes and regulations related to data privacy vary markedly around the world. Data security and appropriate use are high priorities, but access should not be restricted to the extent that the data are of limited usefulness. Access to EHR data by regulatory agencies will be necessary for auditing purposes in registration trials. Distributed analyses have the advantage of allowing data to remain with the individual site and under its control [ 39 , 41 ].

Pre-trial planning is critical to anticipate data security issues and to develop optimal standards and infrastructure. For pivotal registration trials, patients should be informed during the consent process about how their EHRs will be used and by whom. Modified approaches to obtaining informed consent for comparative effectiveness research studies of commonly used clinical practices or interventions may be possible [ 50 ]. A general upfront consent stating that EHR data may be used for research is a proactive step that may minimize later barriers to data access, although revision of existing legislation or ethics board rules may be needed to allow this approach. Patients and the public should be recognized as important stakeholders, and they can be advocates for clinical research using EHRs and improve the quality of EHR-based research if they are educated and engaged in the process and the purpose and procedures for EHR use are transparent. Developing optimal procedures for ensuring patients that are informed and protected, balanced with minimizing barriers to research is a major consideration as EHR-based research advances.

System capabilities

EHRs for use in clinical research need a flexible architecture to accommodate studies of different interventions or disease states. EHR systems may be capable of matching eligibility criteria to relevant data fields and flagging potential trial subjects to investigators. Patient questionnaires and surveys can be linked to EHRs to provide additional context to clinical data. Pre-population of eCRFs has been proposed as a potential role for EHRs, but the proportion of fields in an EHR that can be mapped to an eCRF varies substantially across systems.

EHRs may be more suitable for pragmatic trials where data collection mirrors those variables collected in routine clinical care. Whether regulators would require collection of additional elements to support a new drug or new indication depends on the drug, intended indication, patient population, and potential safety concerns.

Sustainability

The sustainability of EHRs in clinical research will largely depend on the materialization of their promised efficiencies. Programs like the NIH Collaboratory [ 41 ] and PCORnet [ 39 , 41 ], and randomized registry trials [ 51 , 52 ] are demonstrating the feasibility of these more efficient approaches to clinical research. The sustainability of using EHRs for pivotal registration clinical trials will depend on regulatory acceptance of the approach and whether the efficiencies support a business case for their use.

Role of stakeholders

To make the vision of EHRs in clinical trials a reality, stakeholders should collaborate and contribute to the advancement of EHRs for research. Professional bodies, such as the ESC, can play a major role in the training and education of researchers and the public about the potential value of EHR. Clinical trialists and industry must be committed to advancing validation methodology [ 53 ]. Investigators should develop, conduct, and promote institutional EHR trials that change clinical practice; such experience may encourage EHR trial adoption by industry and the agencies. Development of core or minimal data sets could streamline the process, reduce redundancy and heterogeneity, and decrease start-up time for future EHR-based clinical trials. These and other stakeholder contributions are outlined in Table  3 .

Electronic health records are a promising resource to improve the efficiency of clinical trials and to capitalize on novel research approaches. EHRs are useful data sources to support comparative effectiveness research and new trial designs that may answer relevant clinical questions as well as improve efficiency and reduce the cost of cardiovascular clinical research. Initial experience with EHRs has been encouraging, and accruing knowledge will continue to transform the application of EHRs for clinical research. The pace of technology has produced unprecedented analytic capabilities, but these must be pursued with appropriate measures in place to manage security, privacy, and ensure adequacy of informed consent. Ongoing programs have implemented creative solutions for these issues using distributed analyses to allow organizations to retain data control and by engaging patient stakeholders. Whether EHRs can be successfully applied to the conventional drug development in pivotal, registration trials remains to be seen and will depend on demonstration of data quality and validity, as well as realization of expected efficiencies.

Jackson N, Atar D, Borentain M, Breithardt G, van Eickels M, Endres M, Fraass U, Friede T, Hannachi H, Janmohamed S, Kreuzer J, Landray M, Lautsch D, Le Floch C, Mol P, Naci H, Samani N, Svensson A, Thorstensen C, Tijssen J, Vandzhura V, Zalewski A, Kirchhof P (2016) Improving clinical trials for cardiovascular diseases: a position paper from the Cardiovascular Roundtable of the European Society of Cardiology. Eur Heart J 37:747–754

Article   PubMed   Google Scholar  

Eisenstein EL, Collins R, Cracknell BS, Podesta O, Reid ED, Sandercock P, Shakhov Y, Terrin ML, Sellers MA, Califf RM, Granger CB, Diaz R (2008) Sensible approaches for reducing clinical trial costs. Clin Trials 5:75–84

Denaxas SC, Morley KI (2015) Big biomedical data and cardiovascular disease research: opportunities and challenges. European Heart Journal - Quality of Care and Clinical Outcomes 1:9–16

Article   Google Scholar  

Hayrinen K, Saranto K, Nykanen P (2008) Definition, structure, content, use and impacts of electronic health records: a review of the research literature. Int J Med Inform 77:291–304

Appari A, Eric JM, Anthony DL (2013) Meaningful use of electronic health record systems and process quality of care: evidence from a panel data analysis of U.S. acute-care hospitals. Health Serv Res 48:354–375

Blumenthal D, Tavenner M (2010) The “meaningful use” regulation for electronic health records. N Engl J Med 363:501–504

Article   CAS   PubMed   Google Scholar  

Roumia M, Steinhubl S (2014) Improving cardiovascular outcomes using electronic health records. Curr Cardiol Rep 16:451

Doods J, Botteri F, Dugas M, Fritz F (2014) A European inventory of common electronic health record data elements for clinical trial feasibility. Trials 15:18

Article   PubMed Central   PubMed   Google Scholar  

Collins FS, Hudson KL, Briggs JP, Lauer MS (2014) PCORnet: turning a dream into reality. J Am Med Inform Assoc 21:576–577

James S, Rao SV, Granger CB (2015) Registry-based randomized clinical trials–a new clinical trial paradigm. Nat Rev Cardiol 12:312–316

Krumholz HM, Normand SL, Wang Y (2014) Trends in hospitalizations and outcomes for acute cardiovascular disease and stroke, 1999-2011. Circulation 130:966–975

Hlatky MA, Ray RM, Burwen DR, Margolis KL, Johnson KC, Kucharska-Newton A, Manson JE, Robinson JG, Safford MM, Allison M, Assimes TL, Bavry AA, Berger J, Cooper-DeHoff RM, Heckbert SR, Li W, Liu S, Martin LW, Perez MV, Tindle HA, Winkelmayer WC, Stefanick ML (2014) Use of Medicare data to identify coronary heart disease outcomes in the Women’s Health Initiative. Circ Cardiovasc Qual Outcomes 7:157–162

Chung SC, Gedeborg R, Nicholas O, James S, Jeppsson A, Wolfe C, Heuschmann P, Wallentin L, Deanfield J, Timmis A, Jernberg T, Hemingway H (2014) Acute myocardial infarction: a comparison of short-term survival in national outcome registries in Sweden and the UK. Lancet 383:1305–1312

Brindis RG, Fitzgerald S, Anderson HV, Shaw RE, Weintraub WS, Williams JF (2001) The American College of Cardiology-National Cardiovascular Data Registry (ACC-NCDR): building a national clinical data repository. J Am Coll Cardiol 37:2240–2245

Scholte op Reimer W, Gitt A, Boersma E, Simoons Me (2006) Cardiovascular diseases in Europe. Euro Heart Survey−2006. European Society of Cardiology, . Sophia Antipolis

Ferrari R (2010) EURObservational research programme. Eur Heart J 31:1023–1031

Smaha LA (2004) The American Heart Association Get With The Guidelines program. Am Heart J 148:S46–S48

Krumholz HM (2014) Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff (Millwood) 33:1163–1170

Wood R, Clark D, King A, Mackay D, Pell J (2013) Novel cross-sectoral linkage of routine health and education data at an all-Scotland level: a feasibility study. Lancet 382(Supplement 3):S10

Cederholm S, Hill G, Asiimwe A, Bate A, Bhayat F, Persson BG, Bergvall T, Ansell D, Star K, Noren GN (2015) Structured assessment for prospective identification of safety signals in electronic medical records: evaluation in the health improvement network. Drug Saf 38:87–100

Trifiro G, Fourrier-Reglat A, Sturkenboom MC, Diaz AC, Van Der Lei J (2009) The EU-ADR project: preliminary results and perspective. Stud Health Technol Inform 148:43–49

PubMed   Google Scholar  

Eichler HG, Pignatti F, Flamion B, Leufkens H, Breckenridge A (2008) Balancing early market access to new drugs with the need for benefit/risk data: a mounting dilemma. Nat Rev Drug Discov 7:818–826

Goedecke T, Arlett P (2014) A Description of the European Network of Centres for pharmacoepidemiology and pharmacovigilance as a global resource for pharmacovigilance and pharmacoepidemiology. Mann’s pharmacovigilance. Wiley, New York, pp 403–408

Google Scholar  

Ball R, Robb M, Anderson SA, Dal Pan G (2016) The FDA’s sentinel initiative-A comprehensive approach to medical product surveillance. Clin Pharmacol Ther 99:265–268

Staffa JA, Dal Pan GJ (2012) Regulatory innovation in postmarketing risk assessment and management. Clin Pharmacol Ther 91:555–557

Peterson ED, Shah BR, Parsons L, Pollack CV Jr, French WJ, Canto JG, Gibson CM, Rogers WJ (2008) Trends in quality of care for patients with acute myocardial infarction in the National Registry of Myocardial Infarction from 1990 to 2006. Am Heart J 156:1045–1055

Chan PS, Maddox TM, Tang F, Spinler S, Spertus JA (2011) Practice-level variation in warfarin use among outpatients with atrial fibrillation (from the NCDR PINNACLE program). Am J Cardiol 108:1136–1140

Article   CAS   PubMed Central   PubMed   Google Scholar  

Maddox TM, Chan PS, Spertus JA, Tang F, Jones P, Ho PM, Bradley SM, Tsai TT, Bhatt DL, Peterson PN (2014) Variations in coronary artery disease secondary prevention prescriptions among outpatient cardiology practices: insights from the NCDR (National Cardiovascular Data Registry). J Am Coll Cardiol 63:539–546

Jernberg T, Attebring MF, Hambraeus K, Ivert T, James S, Jeppsson A, Lagerqvist B, Lindahl B, Stenestrand U, Wallentin L (2010) The Swedish Web-system for enhancement and development of evidence-based care in heart disease evaluated according to recommended therapies (SWEDEHEART). Heart 96:1617–1621

Cleland JG, Swedberg K, Follath F, Komajda M, Cohen-Solal A, Aguilar JC, Dietz R, Gavazzi A, Hobbs R, Korewicki J, Madeira HC, Moiseyev VS, Preda I, van Gilst WH, Widimsky J, Freemantle N, Eastaugh J, Mason J (2003) The EuroHeart Failure survey programme: a survey on the quality of care among patients with heart failure in Europe. Part 1: patient characteristics and diagnosis. Eur Heart J 24:442–463

Nieminen MS, Brutsaert D, Dickstein K, Drexler H, Follath F, Harjola VP, Hochadel M, Komajda M, Lassus J, Lopez-Sendon JL, Ponikowski P, Tavazzi L (2006) EuroHeart Failure Survey II (EHFS II): a survey on hospitalized acute heart failure patients: description of population. Eur Heart J 27:2725–2736

Tofield A (2010) EURObservational research programme. Eur Heart J 31:1023–1031

McNamara RL, Herrin J, Bradley EH, Portnay EL, Curtis JP, Wang Y, Magid DJ, Blaney M, Krumholz HM (2006) Hospital improvement in time to reperfusion in patients with acute myocardial infarction, 1999 to 2002. J Am Coll Cardiol 47:45–51

Kopcke F, Trinczek B, Majeed RW, Schreiweis B, Wenk J, Leusch T, Ganslandt T, Ohmann C, Bergh B, Rohrig R, Dugas M, Prokosch HU (2013) Evaluation of data completeness in the electronic health record for the purpose of patient recruitment into clinical trials: a retrospective analysis of element presence. BMC Med Inform Decis Mak 13:37

Thadani SR, Weng C, Bigger JT, Ennever JF, Wajngurt D (2009) Electronic screening improves efficiency in clinical trial recruitment. J Am Med Inform Assoc 16:869–873

De Moor G, Sundgren M, Kalra D, Schmidt A, Dugas M, Claerhout B, Karakoyun T, Ohmann C, Lastic PY, Ammour N, Kush R, Dupont D, Cuggia M, Daniel C, Thienpont G, Coorevits P (2015) Using electronic health records for clinical research: the case of the EHR4CR project. J Biomed Inform 53:162–173

Fordyce CB, Roe MT, Ahmad T, Libby P, Borer JS, Hiatt WR, Bristow MR, Packer M, Wasserman SM, Braunstein N, Pitt B, DeMets DL, Cooper-Arnold K, Armstrong PW, Berkowitz SD, Scott R, Prats J, Galis ZS, Stockbridge N, Peterson ED, Califf RM (2015) Cardiovascular drug development: is it dead or just hibernating? J Am Coll Cardiol 65:1567–1582

New JP, Bakerly ND, Leather D, Woodcock A (2014) Obtaining real-world evidence: the Salford Lung Study. Thorax 69:1152–1154

Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS (2014) Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc 21:578–582

Hernandez AF, Fleurence RL, Rothman RL (2015) The ADAPTABLE Trial and PCORnet: shining light on a new research paradigm. Ann Intern Med 163:635–636

Curtis LH, Brown J, Platt R (2014) Four health data networks illustrate the potential for a shared national multipurpose big-data network. Health Aff (Millwood) 33:1178–1186

Jha AK, DesRoches CM, Campbell EG, Donelan K, Rao SR, Ferris TG, Shields A, Rosenbaum S, Blumenthal D (2009) Use of electronic health records in U.S. hospitals. N Engl J Med 360:1628–1638

Hersh WR, Weiner MG, Embi PJ, Logan JR, Payne PR, Bernstam EV, Lehmann HP, Hripcsak G, Hartzog TH, Cimino JJ, Saltz JH (2013) Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care 51:S30–S37

Brennan L, Watson M, Klaber R, Charles T (2012) The importance of knowing context of hospital episode statistics when reconfiguring the NHS. BMJ 344:e2432

Green SM (2013) Congruence of disposition after emergency department intubation in the National Hospital Ambulatory Medical Care Survey. Ann Emerg Med 61:423–426

Weiskopf NG, Weng C (2013) Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 20:144–151

Elnahal SM, Joynt KE, Bristol SJ, Jha AK (2011) Electronic health record functions differ between best and worst hospitals. Am J Manag Care 17:e121–e147

PubMed Central   PubMed   Google Scholar  

Flynn MR, Barrett C, Cosio FG, Gitt AK, Wallentin L, Kearney P, Lonergan M, Shelley E, Simoons ML (2005) The Cardiology Audit and Registration Data Standards (CARDS), European data standards for clinical cardiology practice. Eur Heart J 26:308–313

Simoons ML, van der Putten N, Wood D, Boersma E, Bassand JP (2002) The Cardiology Information System: the need for data standards for integration of systems for patient care, registries and guidelines for clinical practice. Eur Heart J 23:1148–1152

Sugarman J, Califf RM (2014) Ethics and regulatory complexities for pragmatic clinical trials. JAMA 311:2381–2382

Frobert O, Lagerqvist B, Olivecrona GK, Omerovic E, Gudnason T, Maeng M, Aasa M, Angeras O, Calais F, Danielewicz M, Erlinge D, Hellsten L, Jensen U, Johansson AC, Karegren A, Nilsson J, Robertson L, Sandhall L, Sjogren I, Ostlund O, Harnek J, James SK (2013) Thrombus aspiration during ST-segment elevation myocardial infarction. N Engl J Med 369:1587–1597

Hess CN, Rao SV, Kong DF, Aberle LH, Anstrom KJ, Gibson CM, Gilchrist IC, Jacobs AK, Jolly SS, Mehran R, Messenger JC, Newby LK, Waksman R, Krucoff MW (2013) Embedding a randomized clinical trial into an ongoing registry infrastructure: unique opportunities for efficiency in design of the Study of Access site For Enhancement of Percutaneous Coronary Intervention for Women (SAFE-PCI for Women). Am Heart J 166:421–428

Barry SJ, Dinnett E, Kean S, Gaw A, Ford I (2013) Are routinely collected NHS administrative records suitable for endpoint identification in clinical trials? Evidence from the West of Scotland Coronary Prevention Study. PLoS One 8:e75379

Download references

Acknowledgments

This paper was generated from discussions during a cardiovascular round table (CRT) Workshop organized on 23–24 April 2015 by the European Society of Cardiology (ESC). The CRT is a strategic forum for high-level dialogues between academia, regulators, industry, and ESC leadership to identify and discuss key strategic issues for the future of cardiovascular health in Europe and other parts of the world. We acknowledge Colin Freer for his participation in the meeting. This article reflects the views of the authors and should not be construed to represent FDA’s views or policies. The opinions expressed in this paper are those of the authors and cannot be interpreted as the opinion of any of the organizations that employ the authors. MRC’s salary is supported by the National Institute for Health Research (NIHR) Cardiovascular Biomedical Research Unit at the Royal Brompton Hospital, London, UK.

Conflict of interest

Martin R. Cowie: Research grants from ResMed, Boston Scientific, and Bayer; personal fees from ResMed, Boston Scientific, Bayer, Servier, Novartis, St. Jude Medical, and Pfizer. Juuso Blomster: Astra Zeneca employee. Lesley Curtis: Funding from FDA for work with the Mini-Sentinel program and from PCORI for work with the PCORnet program. Sylvie Duclaux: None. Ian Ford: None. Fleur Fritz: None. Samantha Goldman: None. Salim Janmohamed: GSK employee and shareholder. Jörg Kreuzer: Employee of Boehringer-Ingelheim. Mark Leenay: Employee of Optum. Alexander Michel: Bayer employee and shareholder. Seleen Ong: Employee of Pfizer. Jill Pell: None. Mary Ross Southworth: None. Wendy Gattis Stough: Consultant to European Society of Cardiology, Heart Failure Association of the European Society of Cardiology, European Drug Development Hub, Relypsa, CHU Nancy, Heart Failure Society of America, Overcome, Stealth BioTherapeutics, Covis Pharmaceuticals, University of Gottingen, and University of North Carolina. Martin Thoenes: Employee of Edwards Lifesciences. Faiez Zannad: Personal fees from Boston Scientific, Servier, Pfizer, Novartis, Takeda, Janssen, Resmed, Eli Lilly, CVRx, AstraZeneca, Merck, Stealth Peptides, Relypsa, ZS Pharma, Air Liquide, Quantum Genomics, Bayer for Steering Committee, Advisory Board, or DSMB member. Andrew Zalewski: Employee of GSK.

Author information

Authors and affiliations.

National Heart and Lung Institute, Imperial College London, Royal Brompton Hospital, Sydney Street, London, SW3 6HP, UK

Martin R. Cowie

Astra Zeneca R&D, Molndal, Sweden

Juuso I. Blomster

University of Turku, Turku, Finland

Duke Clinical Research Institute, Durham, NC, USA

Lesley H. Curtis

Servier, Paris, France

Sylvie Duclaux

Robertson Centre for Biostatistics, University of Glasgow, Glasgow, UK

University of Münster, Münster, Germany

Fleur Fritz

Daiichi-Sankyo, London, UK

Samantha Goldman

GlaxoSmithKline, Stockley Park, UK

Salim Janmohamed

Boehringer-Ingelheim, Pharma GmbH & Co KG, Ingelheim, Germany

Jörg Kreuzer

Optum International, London, UK

Mark Leenay

Bayer Pharma, Berlin, Germany

Alexander Michel

Pfizer Ltd., Surrey, UK

Institute of Health and Wellbeing, University of Glasgow, Glasgow, UK

Jill P. Pell

Food and Drug Administration, Silver Spring, MD, USA

Mary Ross Southworth

Campbell University College of Pharmacy and Health Sciences, Campbell, NC, USA

Wendy Gattis Stough

Edwards LifeSciences, Nyon, Switzerland

Martin Thoenes

INSERM, Centre d’Investigation Clinique 9501 and Unité 961, Centre Hospitalier Universitaire, Nancy, France

Faiez Zannad

Department of Cardiology, Nancy University, Université de Lorraine, Nancy, France

Glaxo Smith Kline, King of Prussia, Pennsylvania, USA

Andrew Zalewski

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Martin R. Cowie .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cowie, M.R., Blomster, J.I., Curtis, L.H. et al. Electronic health records to facilitate clinical research. Clin Res Cardiol 106 , 1–9 (2017). https://doi.org/10.1007/s00392-016-1025-6

Download citation

Received : 04 May 2016

Accepted : 05 August 2016

Published : 24 August 2016

Issue Date : January 2017

DOI : https://doi.org/10.1007/s00392-016-1025-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Electronic health records
  • Clinical trials as topic
  • Pragmatic clinical trials as topic
  • Cardiovascular diseases
  • Find a journal
  • Publish with us
  • Track your research
  • Open access
  • Published: 10 May 2021

Accessing routinely collected health data to improve clinical trials: recent experience of access

  • Archie Macnair   ORCID: orcid.org/0000-0001-5429-9114 1 , 2 ,
  • Sharon B. Love   ORCID: orcid.org/0000-0002-6695-5390 1 , 2 ,
  • Macey L. Murray   ORCID: orcid.org/0000-0001-6418-0854 1 , 2 ,
  • Duncan C. Gilbert   ORCID: orcid.org/0000-0003-1859-7012 1 ,
  • Mahesh K. B. Parmar   ORCID: orcid.org/0000-0003-0166-1700 1 ,
  • Tom Denwood 3 ,
  • James Carpenter   ORCID: orcid.org/0000-0003-3890-6206 1 , 2 , 4 ,
  • Matthew R. Sydes   ORCID: orcid.org/0000-0002-9323-1371 1 , 2 ,
  • Ruth E. Langley   ORCID: orcid.org/0000-0002-9706-016X 1   na1 &
  • Fay H. Cafferty   ORCID: orcid.org/0000-0002-0973-660X 1   na1  

Trials volume  22 , Article number:  340 ( 2021 ) Cite this article

2548 Accesses

11 Citations

10 Altmetric

Metrics details

Routinely collected electronic health records (EHRs) have the potential to enhance randomised controlled trials (RCTs) by facilitating recruitment and follow-up. Despite this, current EHR use is minimal in UK RCTs, in part due to ongoing concerns about the utility (reliability, completeness, accuracy) and accessibility of the data. The aim of this manuscript is to document the process, timelines and challenges of the application process to help improve the service both for the applicants and data holders.

This is a qualitative paper providing a descriptive narrative from one UK clinical trials unit (MRC CTU at UCL) on the experience of two trial teams’ application process to access data from three large English national datasets: National Cancer Registration and Analysis Service (NCRAS), National Institute for Cardiovascular Outcomes Research (NICOR) and NHS Digital to establish themes for discussion. The underpinning reason for applying for the data was to compare EHRs with data collected through case report forms in two RCTs, Add-Aspirin (ISRCTN 74358648) and PATCH (ISRCTN 70406718).

The Add-Aspirin trial, which had a pre-planned embedded sub-study to assess EHR, received data from NCRAS 13 months after the first application. In the PATCH trial, the decision to request data was made whilst the trial was recruiting. The study received data after 8 months from NICOR and 15 months for NHS Digital following final application submission. This concluded in May 2020. Prior to application submission, significant time and effort was needed particularly in relation to the PATCH trial where negotiations over consent and data linkage took many years.

Conclusions

Our experience demonstrates that data access can be a prolonged and complex process. This is compounded if multiple data sources are required for the same project. This needs to be factored in when planning to use EHR within RCTs and is best considered prior to conception of the trial. Data holders and researchers are endeavouring to simplify and streamline the application process so that the potential of EHR can be realised for clinical trials.

Peer Review reports

Routinely collected electronic health records (EHRs) have been identified as an important innovation in the conduct of randomised clinical trials (RCTs) [ 1 ]. EHRs could improve the efficiency and cost of trials by possibly enhancing recruitment, more complete data sets and minimal loss to follow-up [ 2 , 3 ]. For example, the TASTE trial (ISRCTN16716833), using the Swedish angiography and angioplasty registry, is one of several trials demonstrating the utility of registry-held EHRs to recruit and follow up participants. This study was able to recruit 82% of eligible patients from the registry and obtained complete follow-up data in a trial of 7244 patients [ 4 ]. They also demonstrated meaningfully lower costs for managing the study with a cost per participant in the order of ~$50 compared to costs for a conventional RCT which may be in excess of $1000 per participant [ 4 , 5 ].

EHRs are often collected by centralised registries and audits (national or regional) for purposes other than clinical research to gather detailed information on specific diseases, treatments or populations. However, there are concerns, depending on the source, that data collected in this way may not be of appropriate detail or quality for use in clinical trials [ 6 ]. Access to EHRs by researchers usually requires a formal application to the data holder where specific criteria must be evidenced including compliance with information governance (IG) regulations and a clear purpose and legal basis for the data access.

One potential concern for clinical trialists is that the application process will be complex and lengthy and that the data will not be obtained in a timely manner [ 7 ]. There have been reports that RCTs were unable to publish trial results due to data access [ 8 ]. One example is the EPOCH trial (ISRCTN80682973), where the research team were unable to procure mortality from Welsh data following hospital admissions. As a result, the researchers had to change their planned primary analysis to make sure their publication was not delayed significantly [ 9 ].

The aim of this article is to share and reflect upon our experience at the MRC Clinical Trials Unit at UCL (hereafter ‘MRC CTU’) in applying to three national holders of EHR datasets in the UK for data relating to two ongoing RCTs. The intention is to highlight some of the hurdles in obtaining data and discuss possible solutions. The overarching aim is to assist future applicants and help data providers, who are commonly trying to improve their processes and address these issues in a way that is mutually beneficial.

This is a qualitative study based on recent experience of the teams at an accredited clinical trials unit (MRC CTU) in applying for and accessing routine datasets in England (for two separate trials). The data access applications are linked by one main applicant as part of their clinical methodology research and use a descriptive narrative from documented exchanges between the data holder and applicant to establish themes for discussion. These were chosen as they cover recent access to some of the main datasets likely to be used by clinical trialists with a range of common clinical outcomes. The MRC CTU sought English EHR data for the Add-Aspirin (ISRCTN 74358648) and PATCH (ISRCTN 70406718) trials.

Add-Aspirin aims to assess whether daily aspirin use after treatment for an early-stage cancer can prevent recurrence and improve survival [ 10 ]. It will recruit 11,000 participants in the UK, Republic of Ireland and India; recruitment began in October 2015 and is ongoing. The Add-Aspirin protocol includes a methodological sub-study designed to assess the feasibility of applying for and using EHRs from the National Cancer Registration and Analysis Service (NCRAS) [ 11 ] to assist in the long-term follow-up of participants after completion of trial treatment.

PATCH is a randomised trial of approximately 2500 participants with prostate cancer in the UK. It is assessing the efficacy and safety of a novel therapy transdermal oestradiol patches against standard hormone therapy [ 12 ]. Transdermal patches may have a better side-effect profile compared with standard treatment but there was a prior concern about increased cardiovascular toxicity based on trials of oral oestrogens in the 1970s. PATCH therefore had enhanced monitoring of cardiovascular outcomes, gathering all available information about each event with an additional clinical review [ 12 ]. After the trial started, a methodology sub-study was initiated to compare serious adverse cardiovascular events reported by research staff at participating sites through trial-specific data collection forms with those routinely collected from, and reported in, audits held by the National Institute for Cardiovascular Outcomes Research (NICOR) and Hospital Episodes Statistics (HES) held by NHS Digital. Concordance between the three datasets would support the premise that routinely collected data could supplement or replace long-term cardiotoxicity data in this trial and other future RCTs.

The routine data to be accessed for these two projects are held and collated by three different organisations with their own individual processes to allow data access. Although the organisations are all within the auspices of the English National Health Service, each has evolved in recent years. This, along with revisions to the legal framework for IG, means that the process of data access has also evolved.

National Cancer Registration and Analysis Service (NCRAS)

In 2016, NCRAS was formed from the merger of the National Cancer Intelligence Network (NCIN) and National Disease Registration (NDR) within Public Health England [ 13 ]. In England, NCRAS manages the collection of data relating to cancer. The aim is to monitor cancer incidence, improve care and clinical outcomes, aid research and support genetic counselling [ 11 ]. NCRAS hold several different datasets covering cancer registration and cancer treatments (systemic therapy and radiotherapy). They can also link these datasets to others held by NHS Digital or the Office for National Statistics (ONS), such as mortality data and HES, via NHS number or other personal identifiers.

To gain access to this data for research, an application must be submitted to the Office for Data Release (ODR) [ 14 ]. The ODR application process is outlined in Fig. 1 [ 14 ].

figure 1

Flow diagram of data access via the Office for Data Release (ODR) for National Cancer Registration and Analysis Service (NCRAS) data, adapted from Public Health England (PHE) [ 14 ]

NHS Digital

NHS Digital has been the custodian of HES since 2016. Prior to this, it operated under the Health and Social Care Information Centre (HSC-IC) from 2005 [ 15 ]. NHS Digital collects, processes and provides access to many EHR datasets and is continually seeking to supplement this data with other datasets from various care settings. HES is primarily a resource for reimbursement of hospital activity and holds patient-level information on more than 500 variables ranging from diagnosis, procedures, admission dates, demographics of the patients and healthcare provider [ 16 ]. NHS Digital has a large number of organisations requesting access to their data with most coming from local authorities and Clinical Commissioning Groups [ 8 ]; access is provided by application to the Data Access Request Service (DARS) [ 17 ]. The Independent Group Advising on the Release of Data (IGARD) gives an independent final review that aims to improve transparency, accountability, quality and consistency of the application process. IGARD currently meets weekly to make sure that applications are reviewed in a timely fashion. The application process continues to change with attempts to improve its service; the current process is outlined in Fig. 2 [ 17 ].

figure 2

Flow diagram of the Data Access Request Service (DARS) for NHS Digital data, adapted from NHS Digital [ 17 ]. IGARD, Independent Group Advising on the Release of Data

National Institute for Cardiovascular Outcomes Research (NICOR)

NICOR collects routine EHR data and produces analyses to enable hospitals and healthcare improvement bodies to monitor and improve the care and outcomes of patients with cardiovascular disease. It manages six national clinical audits and a number of new health technology registries [ 18 ]. NICOR is regulated and contracted by the Health Quality Improvement Partnership (HQIP). NICOR was originally hosted by UCL but moved to Barts Health NHS Trust in 2017. The two audits that were identified as potentially relevant to the PATCH trial were the National Heart Failure Audit (NHFA) and the Myocardial Ischaemia National Audit Project (MINAP). The application process to obtain data from NICOR is shown in Fig. 3 [ 18 ]. Historically, far fewer researchers have used this source compared to NHS Digital and NCRAS [ 8 ].

figure 3

Data access request for access to National Institute for Cardiovascular Outcomes Research (NICOR) data adapted from [ 18 ]. HQIP, Health Quality Improvement Partnership; NCAP, National Cardiac Audit Programme

Add-Aspirin

The Add-Aspirin trial was conceived with the recognition that participants will require follow-up for at least 10 years [ 10 ]. This length of follow-up is required to assess the overall risk: benefit of regular aspirin use on the trial participants’ health. From the design stage of the trial, like for many trials [ 19 ], there was an intention to access data using routinely collected EHRs. When the trial was initially conceived in 2012, the Add-Aspirin trial team met with individuals from NCIN, the predecessor of NCRAS, to assess the feasibility of accessing data and also to ensure that an appropriate budget for this activity was incorporated into funding applications (Fig. 4 ). The protocol, patient information sheets and consent forms were designed to reflect the potential use of routinely collected healthcare data.

figure 4

Flow diagram of the Add-Aspirin National Cancer Registration and Analysis Service (NCRAS) application. (Please note that timeline is not proportional) REC approval for Add-Aspirin March 2014. Recruitment opened in October 2015 and is ongoing. CTU, clinical trials unit; DSA, data sharing agreement; NCIN, National Cancer Intelligence Network; ODR, Office for Data Release; REC, Research Ethics Committee

In 2017, after 2 years of recruitment and follow-up, there was a conversation with ODR to confirm the cost and current application process. In 2018, there was sufficient data to initiate the pre-defined methodology sub-study. A pre-application meeting with an ODR senior manager established the documentation that was needed going forward.

Following the implementation of the General Data Protection Regulation (GDPR) in the UK (2018), transparency of how exactly participant data would be used became a legal requirement. The previously agreed consent forms and patient information sheets did not meet the 2018 requirements of GDPR. The solution was for a privacy notice to be drafted and made publicly accessible on the trial’s website. The trial’s IG documentation also needed updating to ensure information security assurances (via the Data Security and Protection Toolkit) were in place within UCL.

Following submission of the data application (December 2018), ODR sent back revisions (January 2019) and confirmed the transparency statement (February 2019). For the application to proceed, an analyst needed to be allocated to check the defined data requirements. In April 2019, NCRAS unfortunately unassigned the analyst allocated to Add-Aspirin onto work on a project considered more critical. There was a meeting in May 2019, once further analytical support had been deployed, to discuss the data field requests. The new analysts suggested that a number of data fields should be expanded to give the best chance of capturing cancer recurrence as this is not, at present, collected sufficiently well within any single EHR dataset. They acknowledged at that time that algorithms were needed to identify data patterns indicative of tumour recurrence. ODR wanted to ensure that no unnecessary data from HES was provided for each participant. The MRC CTU therefore provided surgical/procedure codes (using Office of Population Censuses and Surveys (OPCS) definitions) and diagnosis codes (ICD-10 codes) to NCRAS to focus and limit the data extraction. In June 2019, it was agreed with ODR and NCRAS that, as this was a methodological project reviewing ways to gather trial outcomes in registry data, all HES data for these patients could be given to the MRC CTU.

The application then underwent an ODR internal moderation review, and a month later, a data sharing agreement (DSA) was sent from ODR to MRC CTU. Between August and October 2019, there were ongoing discussions between the MRC CTU contracts department and the ODR. The final DSA was signed on behalf of MRC CTU on 16 October 2019 and fully executed by ODR on 15 November 2019. A further new analyst was then assigned to the project who re-reviewed the data request. This new analyst advised an update to the data censor dates, since more up-to-date data was now available from NCRAS. The updated data request was sent back to ODR for re-signing. The DSA was re-signed and the MRC CTU checked the current consent status of patients before sending participants identifiable data to NCRAS on 23 December 2019. The one-off data extracts were successfully received at the MRC CTU on 06 February 2020. This 6-week interval before data receipt was due to NCRAS rewriting their standard filters to provide C44 (non-melanoma skin cancer) — a code that is not usually supplied but needed for this trial. In total, this application, excluding the planning and preparatory work, took approximately 13 months from submission of the application to receiving the data.

The PATCH trial opened to recruitment in 2006 as a phase II feasibility trial, developing into a phase III RCT in 2013. The trial was not initiated with the use of EHR in mind but there was a statement included in the consent form to potentially allow information to be sought from the national registries in the future:

I agree that my details including my full name can be given to the MRC such that long-term follow-up information from the NHS Information Centre and the NHS Central Register or any applicable NHS information system.

With the assumption of valid consent for the use of EHR data, a methodological sub-study was devised to triangulate cardiovascular event data between HES, NICOR and trial data. There was an initial scoping of the project in 2014 with NICOR and HSC-IC advising data linkage before comparison at the MRC CTU (Fig. 5 ). During the initial conversations with NICOR and HSC-IC, the organisations stated that the consent statement was insufficient to acquire linked data from these two sources without first gaining approval from the Confidentiality Advisory Group (CAG). In 2016, the process to submit a CAG application was started. Several months of delays followed due to difficulty in acquiring the appropriate IG documentation for PATCH. CAG require detailed IG documentation for both the trial but also in this case from NICOR and NHS Digital (formerly HSC-IC until 2016). There were difficulties in identifying the appropriate person for this information within NHS Digital, taking most of 2016 to achieve (note: at this time, case officers were not assigned until after the application was formally submitted). During 2016, an alternative method of data access was explored via NCRAS, but as no cancer data was being sought, this option was deemed unviable. Consequently, in 2017, the project was put on hold.

figure 5

Flow diagram of the PATCH joint application to NHS Digital and National Institute for Cardiovascular Outcomes Research (NICOR) and subsequently handled as separate applications in 2018. (Please note that timeline is not proportional) REC approval for PATCH November 2005. Recruitment opened in April 2006 and is ongoing. CAG, Confidentiality Advisory Group; DAO, data approvals officer; DARS, Data Access Request Service; HQIP, Health Quality Improvement Partnership; HSC-IC, Health and Social Care Information Centre; IGARD, Independent Group Advising on the Release of Data; REC, Research Ethics Committee

In October 2018, the MRC CTU re-engaged with NICOR (which had moved to Barts Health NHS Trust following a European Union tender process) and NHS Digital. There were additional complexities for obtaining CAG approval as the PATCH trial at the time was in the process of changing sponsor and therefore the CAG application could not be approved.

As the explicit wording on the consent form was the main issue preventing access to the data, the MRC CTU asked the MRC Regulatory Support Centre for further guidance. They felt that the consent wording was sufficient. NICOR subsequently agreed that, if their data was not sent to NHS Digital for linkage, then CAG approval was not necessary. Therefore a further application was submitted and sent to NICOR for review (Fig. 3 ). NICOR’s review was completed in May 2019. The application was then submitted to HQIP by NICOR. The application was reviewed in June and amendments were returned to MRC CTU. HQIP issued a signed DSA on 19 July 2019, and a NICOR analyst was assigned. The analyst continued discussions with the MRC CTU on data extraction, and a one-off data extract was received at the MRC CTU on 17 October 2019.

As with NICOR, NHS Digital was re-engaged in October 2018, and it took several weeks to allow access to the DARS online system due to technical difficulties with the DARS system (Fig. 5 ). A new DARS application was submitted in February 2019, but this was initially rejected due to issues around consent and sponsorship and not meeting the DARS checklist criteria. After a phone call to DARS and changes to the application by the MRC CTU, it was accepted and a case officer allocated. The case officer reviewed and made extensive comments with required changes. A privacy notice was created for the project and circulated to participants once it was ethically approved. NHS Digital then advised that the application could not proceed until the NICOR DSA was signed, sponsorship clarified and the new protocol for the sub-study had been ethically approved.

Sponsorship was not resolved until September 2019, and at that point, the MRC CTU re-engaged with NHS Digital. On receipt of the revised application, NHS Digital returned it to the DARS triage service and a new case officer was allocated. Over the next few months, the case officer made amendments to the application and sent it internally to the data approvals officer (DAO). The DAO asked for further changes to the application to clarify certain points and was submitted to IGARD in December 2019 for final review. IGARD approved the application in January subject to one last data specification amendment. The DSA was signed on behalf of the MRC CTU in February 2020, and the MRC CTU uploaded identifiable data to NHS Digital in March. The NHS Digital production team made data available in May and data was received at the MRC CTU on 21 May 2020. When all efforts are taken into consideration, it has taken several years to obtain data from both of these providers. However, from the most recent effort, data was received approximately 8 and 15 months after submission of formal applications to NICOR and NHS Digital respectively.

This article describes the MRC CTU’s experience of attempting to access EHR data from three English national data holders (NCRAS, NICOR and NHS Digital) for two large trials with a view to identifying shareable lessons. These data access applications were chosen as they were both for methodological studies embedded within RCTs looking at the appropriateness of EHR data to be used in trial follow-up with the important juxtaposition of where data access is planned versus being a later addition. The aim was to improve the knowledge and experience of gaining access to these datasets and to assess the accuracy of nationally held EHR data compared to data manually collected as part of conventional trial-specific follow-up. Our experience was challenging and took many person hours over 8 to 15 months from formally submitting an application to receiving the data.

There are limitations to this paper as this is specific to English national data holders and other countries may not have the same application issues or comparable registry data quality. This is also an experience paper from one clinical trials unit, and the difficulties we had in acquiring the data may potentially be unique. The nature of the trials, the infrastructure within this specific trials unit, the introducing of significant data protection legislation (GDPR; May 2018) during the period that provide new requirements, and the relative infrequency of our applications could be factors in the delays encountered. The process of applying for data for the PATCH trial started more than 5 years ago but the most recent iteration of applications for data started in October 2018. However, this is not a story in isolation and there have been other publications demonstrating similar problems [ 7 , 9 , 20 , 21 ]. At present, the application process for each of these datasets is too complicated and discourages researchers from using this invaluable data. A recent survey of the cancer research community, conducted by the National Cancer Research Institute, found that less than half were successful in accessing data from the national datasets and, when asked what would help most, the majority answered ‘support through data access process’ and ‘improving timelines for the application approval’ [ 22 ]. The difficulty of accessing this data may be why so few clinical trials have used national datasets to enrich or replace data collected via conventional case report forms [ 8 ].

From a clinical trialist perspective, several lessons have been learnt about the process of applying for and obtaining EHR data. Firstly, it is extremely challenging to acquire data for an actively recruiting trial that had not planned this acquisition in advance. The main issue for the PATCH trial application was the wording in the trial protocol, consent forms and patient information sheets were not initially designed for the sub-study when the application process was started. Although the wording followed current recommendations when first written, information governance procedures and regulations evolved. In contrast, the Add-Aspirin trial had a good foundation due to prior preparation work before the application process began which meant fewer amendments were needed due to new data laws. Clinical trials units need to work closely with registries and data holders to establish the most efficient methods to obtain and access EHR data; this could include clear guidance on the optimal timing of data requests (such as at trial initiation) and accessible, transparent cost structures to allow trialists to obtain sufficient funding for repeated data access through the lifetime of a trial.

Secondly, all clinical trials units need appropriate infrastructure to have the high level of data security needed for storing EHR data, and evidenced through a completed and endorsed Data Security and Protection Toolkit. An example includes the formation of ‘Trusted Research Environments’ which allow a cyber-secure virtual location where identifiable data cannot be removed and only verified researchers can access depending on IG training and specified parameters. Such infrastructure is complicated and costly taking considerable time to set up and to manage going forward. Once the required infrastructure is established, then the data security and IG controls should be valid for any national dataset. The connectivity of these datasets is also an issue, with separate applications having to be completed to several organisations/countries within the UK which takes a considerable amount of time and money. One solution would be a ‘passport’ system for data access to allow an institution that has demonstrated appropriate data security and IG controls to fast track the process. Another solution would be to link more datasets and allow only one application for both. There are new initiatives ongoing with examples of collaboration such as VICORI which links between NICOR and NCRAS data [ 23 ].

Lastly, the applicant also needs experience in how to answer the questions in the forms to stand up to the scrutiny of the data controllers’ checks. These assessments are appropriate but, without prior knowledge, applications are often rejected due to wording rather than due to the nature of their request. This could make it difficult for clinical trials units that only apply occasionally since key knowledge may be lost inducing repetitive errors again, or the team is unaware of how the process has changed. This lack of experience can only be helped by resources provided by the dataset organisations and more guidance through the application process by experienced case officers within those organisations.

NHS Digital and NCRAS are continuing to improve their accessibility through guidelines for the application process, seminars and videos. NHS Digital has established a clinical trials service in collaboration with Health Data Research UK, the University of Oxford, IBM and Microsoft [ 24 ]. This ‘NHS DigiTrials’ is in its infancy and is initially concentrating on helping new trials with the identification of potential participants and follow-up of participants during and post-trial. As part of this, it is directing its attention to helping with data access from EHR for clinical trials by increasing the speed of access and a wider range of data types available. NICOR are also striving to streamline their application process internally and with HQIP to avoid unnecessary delays for appropriate research applications. During the COVID pandemic, there has also been data sharing and routine linkage for the first time between NICOR and NHS Digital that has been used in a number of publications [ 25 ].

For routinely collected EHR to be a viable option of providing data for clinical trials, data access must take no longer than a few months; otherwise, delays cause difficulty with funding and the timeliness for reporting key outcomes. Also, the records within the databases need to be up-to-date. Some may have a reporting lag of up to a year and that limits their utility. Also, better coordination and linkage between the datasets held by separate data controllers would reduce the burden on the applicants. Health Data Research UK (HDR UK) is working with key stakeholders to improve data ‘inclusivity and transparency’ to push the agenda of utilisation of data for science with relevant organisations but also with the public as well. This also includes improving navigation across datasets from different data controllers, via the Health Data Research Innovation Gateway, and bringing together different data controllers under the UK Health Data Alliance [ 26 ]. This is to be consistent with their bold statement of ‘Our Data, Our Society, Our Health’ [ 20 ]. This will hopefully allow the right data to be given to the right people in an efficient but transparent way and provide reassurance to the general public. The accessibility is the first challenge in the use of this data but there is still concern about how appropriate the data is, given that it is not designed for clinical trials. Evaluation of the reliability, completeness and accuracy of data is needed. The analysis of the EHR data of the two methodology projects described above is ongoing and will be the subject of separate publications which will further inform the discussion around the utility of EHR in trials.

EHR contains a wealth of information about individual patient’s health outcomes, which can be useful for clinical trials. Our experience demonstrates that data access can be a prolonged and complex process. This is compounded by the fact that multiple data sources, sometimes from different data holders, will often be required for the same project. Improving data access would be the first step to realise the potential of these datasets. Based on our experience successfully accessing datasets from NHS Digital, NCRAS and NICOR, we have identified pre-planned acquisition of data prior to trial set up is important for researchers considering the use of EHR data for their clinical trials to establish appropriate consent, legal purpose and infrastructure to comply with data security and law. Data holders and researchers are endeavouring to simplify and streamline the application process so that the potential of EHR can be realised for clinical trials.

Availability of data and materials

Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

Abbreviations

Confidentiality Advisory Group of the Health Research Authority (HRA)

Clinical trial unit

Data approvals officer

Data Access Request Service

Data sharing agreement

Electronic health record

General Data Protection Regulation (EU) 2016/679; implemented 25 May 2018

Health Data Research UK

Hospital Episodes Statistics

Health Quality Improvement Partnership

Health and Social Care Information Centre

Information governance

Independent Group Advising on the Release of Data

Myocardial Ischaemia National Audit Project

Medical Research Council Clinical Trials Unit at UCL

National Cardiac Audit Programme

National Cancer Intelligence Network

National Cancer Registration and Analysis Service

National Disease Registration

National Heart Failure Audit

National Institute for Cardiovascular Outcomes Research

Public Health England (PHE) Office for Data Release

Office for National Statistics

Office of Population Censuses and Surveys

A statement made to a data subject that describes how the organisation collects, uses, retains and discloses personal information; also known as a transparency notice

Research Ethics Committee

Lauer MS, D’Agostino RB. The randomized registry trial — the next disruptive technology in clinical research? New Engl J Med. 2013;369(17):1579–81. https://doi.org/10.1056/NEJMp1310102 .

Article   CAS   PubMed   Google Scholar  

Mc Cord KA, Al-Shahi Salman R, Treweek S, Gardner H, Strech D, Whiteley W, et al. Routinely collected data for randomized trials: promises, barriers, and implications. Trials. 2018;19(1):29. https://doi.org/10.1186/s13063-017-2394-5 .

Article   PubMed   PubMed Central   Google Scholar  

Appleyard SE, Gilbert DC. Innovative solutions for clinical trial follow-up: adding value from nationally held UK data. Clin Oncol. 2017;29(12):789–95. https://doi.org/10.1016/j.clon.2017.10.003 .

Article   CAS   Google Scholar  

Lagerqvist B, Fröbert O, Olivecrona GK, Gudnason T, Maeng M, Alström P, et al. Outcomes 1 year after thrombus aspiration for myocardial infarction. New Engl J Med. 2014;371(12):1111–20. https://doi.org/10.1056/NEJMoa1405707 .

Shore BJ, Nasreddine AY, Kocher MS. Overcoming the funding challenge: the cost of randomized controlled trials in the next decade. JBJS. 2012;94(Supplement_1):101–6.

Article   Google Scholar  

McCord K, Hemkens L. Using electronic health records for clinical trials: where do we stand and where can we go? Can Med Assoc J. 2019;191(5):E128–E33. https://doi.org/10.1503/cmaj.180841 .

Lugg-Widger F, Angel L, Cannings-John R, Hood K, Hughes K, Moody G, et al. Challenges in accessing routinely collected data from multiple providers in the UK for primary studies: managing the morass. Int J Popul Data Sci. 2018;3(3):1–14.

Lensen S, Macnair A, Love SB, Yorke-Edwards V, Noor NM, Martyn M, et al. Access to routinely collected health data for clinical trials – review of successful data requests to UK registries. Trials. 2020;21(1):398. https://doi.org/10.1186/s13063-020-04329-8 .

Peden CJ, Stephens T, Martin G, Kahan BC, Thomson A, Rivett K, et al. Effectiveness of a national quality improvement programme to improve survival after emergency abdominal surgery (EPOCH): a stepped-wedge cluster-randomised trial. Lancet. 2019;393(10187):2213–21. https://doi.org/10.1016/S0140-6736(18)32521-2 .

Article   PubMed   Google Scholar  

Coyle C, Cafferty FH, Rowley S, MacKenzie M, Berkman L, Gupta S, et al. ADD-ASPIRIN: a phase III, double-blind, placebo controlled, randomised trial assessing the effects of aspirin on disease recurrence and survival after primary therapy in common non-metastatic solid tumours. Contemp Clin Trials. 2016;51:56–64. https://doi.org/10.1016/j.cct.2016.10.004 .

Public Health England. Guidance National Cancer Registration and Analysis Service 2020 [Available from: https://www.gov.uk/guidance/national-cancer-registration-and-analysis-service-ncras . Accessed 19/02/2020.

Langley RE, Cafferty FH, Alhasso AA, Rosen SD, Sundaram SK, Freeman SC, et al. Cardiovascular outcomes in patients with locally advanced and metastatic prostate cancer treated with luteinising-hormone-releasing-hormone agonists or transdermal oestrogen: the randomised, phase 2 MRC PATCH trial (PR09). Lancet Oncol. 2013;14(4):306–16. https://doi.org/10.1016/S1470-2045(13)70025-1 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Public Health England. National Cancer Intelligence Network (NCIN): 30 + years of cancer intelligence - challenges of technologies of the time [Available from: http://www.ncin.org.uk/home . Accessed 09/08/2019.

Public Health England. Guidance accessing PHE data through the Office for Data Release 2020 [Available from: https://www.gov.uk/government/publications/accessing-public-health-england-data/about-the-phe-odr-and-accessing-data . Accessed 19/02/2020.

Gov.uk. HSCIC changing its name to NHS Digital 2016 [Available from: https://www.gov.uk/government/news/hscic-changing-its-name-to-nhs-digital . Accessed 19/02/2020.

Boyd A. Understanding Hospital Episode Statistics (HES). London, UK: CLOSER; 2017.

Google Scholar  

NHS Digital. Data Access Request Service (DARS): process 2019 [Available from: https://digital.nhs.uk/services/data-access-request-service-dars/data-access-request-service-dars-process . Accessed 20/02/2020.

NICOR. NICOR 2020 [Available from: https://www.nicor.org.uk/ . Accessed 20/02/2020.

McKay AJ, Jones AP, Gamble CL, et al. Use of routinely collected data in a UK cohort of publicly funded randomised clinical trials. F1000Research. 2020;9:323.

Ford E, Boyd A, Bowles JKF, Havard A, Aldridge RW, Curcin V, et al. Our data, our society, our health: a vision for inclusive and transparent health data science in the United Kingdom and beyond. Learning Health Systems. 2019;3(3):e10191. https://doi.org/10.1002/lrh2.10191 .

Dattani N, Hardelid P, Davey J, Gilbert R. Accessing electronic administrative health data for research takes time. Arch Dis Childhood. 2013;98(5):391–2. https://doi.org/10.1136/archdischild-2013-303730 .

National Cancer Research Institute. The researchers’ experience when attempting to access health data for research 2020 [Available from: https://www.ncri.org.uk/ncri-blog/accessing-health-data-for-research/ . Accessed 28/02/2020.

Public Health England. Current analytical partnerships involving the National Cancer Registration and Analysis Service 2019 [Available from: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/787750/Current_analytical_partnerships_involving_NCRAS.pdf . Accessed 01/05/2020.

NHS Digital. NHS DigiTrials 2020 [Available from: https://digital.nhs.uk/services/nhs-digitrials . Accessed 01/06/2020.

Mohamed MO, Gale CP, Kontopantelis E, Doran T, de Belder M, Asaria M, et al. Sex-differences in mortality rates and underlying conditions for COVID-19 deaths in England and Wales. Mayo Clinic Proceedings. 2020;95(10):2110–24. https://doi.org/10.1016/j.mayocp.2020.07.009 .

Health Data Research UK. Health Data Research Innovation Gateway 2020 [Available from: https://www.healthdatagateway.org/ . Accessed 01/06/2020.

Download references

Acknowledgements

We acknowledge and are grateful to NHS Digital Research and clinical trials team and to Mark De Belder, NICOR Operational and Methodology Group Chair, and Luke Hounsome, PhD, Analytical Programme Manager at NCRAS for their comments received on the draft manuscript.

This work was supported by Health Data Research UK; Medical Research Council MC_UU_12023/24. The Add-Aspirin trial is being jointly funded by Cancer Research UK (grant number C471 /A15015), The National Institute for Health Research Health Technology Assessment Programme (project reference 12/01/38) and the MRC Clinical Trials Unit at UCL (MC_UU_12023/28). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. The PATCH study is funded by Cancer Research UK, grant number C471/A12443 (trial CRUK/06/001) and University College London (UCL), is now sponsored by UCL and was previously sponsored by Imperial College London.

Author information

Ruth E. Langley and Fay H. Cafferty contributed equally to this work.

Authors and Affiliations

MRC Clinical Trials Unit at UCL, UCL, London, WC1V 6LJ, UK

Archie Macnair, Sharon B. Love, Macey L. Murray, Duncan C. Gilbert, Mahesh K. B. Parmar, James Carpenter, Matthew R. Sydes, Ruth E. Langley & Fay H. Cafferty

Health Data Research UK, London, UK

Archie Macnair, Sharon B. Love, Macey L. Murray, James Carpenter & Matthew R. Sydes

NHS Digital, 1 Trevelyan Square, Leeds, LS1 6AE, UK

Tom Denwood

Medical Statistics, London School of Hygiene and Tropical Medicine, London, WC1E 7HT, UK

James Carpenter

You can also search for this author in PubMed   Google Scholar

Contributions

AM conceived the manuscript and led the writing. REL, SL, JC, TD, MM and MS wrote critical sections and reviewed and agreed the final version. REL, DG, FC and MP are clinical and statistical leads for Add-Aspirin and PATCH trials and reviewed and agreed the final version of the document. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Archie Macnair .

Ethics declarations

Ethics approval and consent to participate.

Add-Aspirin was approved by the South Central – Oxford C research ethics committee and is part of the UK National Cancer Research Network (NCRN) portfolio. PATCH was approved by the Leeds (East) Research Ethics Committee.

Consent for publication

Not applicable

Competing interests

All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare no support from any organisation for the submitted work; however, MS reports grants from Health Data Research UK, during the conduct of the study; personal fees from Lilly Oncology; personal fees from Janssen; grants and non-financial support from Astellas; grants and non-financial support from Clovis Oncology; grants and non-financial support from Janssen; grants and non-financial support from Novartis; grants and non-financial support from Pfizer; and grants and non-financial support from Sanofi-Aventis, outside the submitted work; FC reports receipt of research grants for the Add-Aspirin trial from Cancer Research UK and the National Institute of Health Research, as well as study drug provision from Bayer Pharmaceuticals. REL reports grants from Cancer Research UK; grants from UK Medical Research Council, during the conduct of the study; and personal fees from Aspirin Foundation, outside the submitted work.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Macnair, A., Love, S.B., Murray, M.L. et al. Accessing routinely collected health data to improve clinical trials: recent experience of access. Trials 22 , 340 (2021). https://doi.org/10.1186/s13063-021-05295-5

Download citation

Received : 06 October 2020

Accepted : 24 April 2021

Published : 10 May 2021

DOI : https://doi.org/10.1186/s13063-021-05295-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Routinely collected data
  • Electronic health records
  • Data accessibility
  • Clinical trials

ISSN: 1745-6215

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

analysis of ehr data for clinical research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Year in Review
  • Published: 19 December 2022

Digital rheumatology in 2022

New developments in electronic health record analysis

  • Jutta G. Richter 1 &
  • Christian Thielscher   ORCID: orcid.org/0000-0002-9987-7325 2  

Nature Reviews Rheumatology volume  19 ,  pages 74–75 ( 2023 ) Cite this article

1810 Accesses

2 Altmetric

Metrics details

  • Health care
  • Rheumatology

Electronic health records (EHRs) contain enormous amounts of real-world data that could inform researchers, doctors and patients about many aspects of rheumatology. However, EHRs are not yet fully utilized, mainly because automatic data extraction is difficult. Several studies in 2022 highlight the feasibility and clinical utility of computer-assisted EHR analysis.

Key advances

Consolidation of electronic health record (EHR) databases can improve our understanding of real-world patient journeys 2 .

A newly developed natural language processing (NLP) pipeline can automatically extract outcome measures from EHR databases and has potential for EHR data analysis for both clinical and research purposes 4 .

Although an automated phenotyping algorithm has potential for diagnosing patients with certain subtypes of rheumatoid arthritis using EHR data, design choices (such as the definition of what key elements are used for phenotyping) and missing data currently limit its clinical utility 3 .

The growth of digitalization — artificial intelligence, machine learning, big data, telemedicine and other new information and communication technologies (ICTs) — provides the potential to improve the diagnosis and treatment of patients with rheumatic diseases. Many ICTs are now entering clinical practice or are already part of standard care. For example, electronic health records (EHRs) and/or other patient documentation systems (such as in hospitals, practices and laboratories) offer a rich resource of data to advance our understanding of rheumatic conditions, and can complement traditional study designs because they capture almost the complete variety of patient journeys with real word data, leading to more generalizable results 1 . In addition, an increasing amount of data is being contained within these systems that might be used to analyse the epidemiological trends of inflammatory rheumatic diseases. However, difficulties remain in utilizing this data as EHR databases are typically partitioned into small entities, and extracting the data is challenging. Three studies in 2022 highlight promising approaches for addressing these issues, creating new epidemiological insights from big data and improving the feasibility and utility of EHR analysis 2 , 3 , 4 .

Consolidation of EHR databases might help optimize EHR analysis to better capture epidemiology trends, as shown by Scott et al. 2 . To study the epidemiology of rheumatoid arthritis (RA), psoriatic arthritis (PsA) and axial spondyloarthritits (SpA) in England, the researchers analysed the Clinical Practice Research Datalink (CPRD) Aurum database , which contains longitudinal routinely collected EHRs from UK primary care practices. The database captures information ranging from demographic characteristics, diagnoses and symptoms, drug exposures to lab tests, and currently covers around 20% of the population in England, with a median follow-up time of ~9 years.

Scott and colleagues used algorithms and updated diagnostic codes, as well as synthetic DMARD code lists, to ascertain patients with a diagnosis of RA, PsA or axial SpA. This approach enabled the researchers to calculate the annual incidence and point-prevalence of RA, PsA and axial SpA diagnoses from 2004 to 2020, stratified by age and sex. For example, the point-prevalence of RA and PsA diagnoses increased annually from 2004 onwards, peaking in 2019, before falling slightly. The point-prevalence of axial SpA diagnoses increased annually (except in 2018 and 2019), peaking in 2020. Finally, the annual incidence of RA, PsA and axial SpA diagnoses fell by 40.1%, 67.4%, and 38.1%, respectively between 2019 and 2020, probably reflecting the impact of the COVID-19 pandemic. This type of insight is especially useful for planning and shaping health services (in this case, NHS services) particularly for the elderly population. Similar approaches could be used in other health-care systems to plan accordingly.

In many situations, automatically extracting data on patients with a certain diagnosis from a database and/or defining subgroups of patients using this data can be useful for researchers. Zheng et al. 3 studied the ability of the Phenotype KnowledgeBase (PheKB) algorithm to automatically identify patients with RA from an EHR database. They found that the specificity of this algorithm was quite good (95%), but the sensitivity was poor (~72%). Notably, the sensitivity of this algorithm was especially low in patients with seronegative RA. The phenotyping algorithm used an automated calculation (based on penalized logistic regression) to select clinically relevant features. Various useful features were captured by the algorithm (such as International Classification of Diseases (ICD) codes and rheumatoid factor laboratory test results), but others were missed, including anti-citrullinated protein antibody (ACPA) laboratory test results and text-based indications of joint involvement. In addition, the phenotyping algorithms were unusable for a notable number of patients owing to a lack of data in the necessary structured format. Hence, the results indicate that ability of this platform to identify the key data elements needed to define phenotypes is limited and expert input is still required. These findings highlight the need for careful design choices when developing phenotyping algorithms. Before phenotyping algorithms can be implemented in routine care, approaches for handling missing data are needed.

Although most EHR systems include some structured data fields for capturing particular information (such as ICD codes), the included fields and their usability can vary across systems, the majority of EHR data are often documented in an unstructured format (such as text) and are thus difficult to analyse. The study by Humbert-Droz et al. 4 highlights one method for navigating this issue. Using data from 2015–2018, including 34 million notes from 854,628 patients, 158 practices and 24 EHRs, the researchers developed and evaluated a natural language processing (NLP) pipeline for extracting mentions of rheumatoid arthritis outcome measures and scores from free-text outpatient rheumatology notes within the Rheumatology Informatics System for Effectiveness (RISE) registry . The RISE registry combines data from different EHRs and consolidates them. The NLP pipeline had a good internal and external validity, with a sensitivity, positive predictive value and F1 score of 95%, 87% and 91%, respectively. Substantial agreement was observed between the scores extracted from the RISE notes and scores derived from structured data within the RISE registry. Thus, the pipeline has potential for facilitating outcome measurement in research but also in clinical care. In the future, the NLP pipeline might also support personalized medicine if used, for example, to automatically analyse the historical EHR data of a specific patient.

Rheumatological diseases are typically chronic in nature. Over time, EHRs can gather enormous amounts of data on individual patients that are difficult to track for human doctors but might provide very helpful information. Putting together EHR data in ever-increasing databases helps to improve research (for example, epidemiology research), and tools such as the NLP pipeline should enable automatic access to this rich resource. In summary, the use of artificial intelligence and machine learning algorithms will hopefully lead to optimized patient-centred care in the near future.

Knevel, R. & Liao, K. P. From real-world electronic health record data to real-world results using artificial intelligence. Ann. Rheum. Dis. https://doi.org/10.1136/ard-2022-222626 (2022).

Article   Google Scholar  

Scott, I. C. et al. Rheumatoid arthritis, psoriatic arthritis, and axial spondyloarthritis epidemiology in England from 2004 to 2020: An observational study using primary care electronic health record data. Lancet Reg. Health Eur. 23 , 100519 (2022).

Zheng, H. W. et al. Evaluation of an automated phenotyping algorithm for rheumatoid arthritis. J Biomed Inform. 135 , 104214 (2022).

Humbert-Droz, M. et al. Development of a Natural Language Processing System for Extracting Rheumatoid Arthritis Outcomes From Clinical Notes Using the National Rheumatology Informatics System for Effectiveness Registry. Arthritis Care Res. (Hoboken) https://doi.org/10.1002/acr.24869 (2022).

Download references

Author information

Authors and affiliations.

Department for Rheumatology and Hiller Research Center, University Hospital, Medical Faculty of Heinrich-Heine-University Duesseldorf, Duesseldorf, Germany

Jutta G. Richter

Competence Center for Medical Economics, FOM University, Essen, Germany

Christian Thielscher

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Christian Thielscher .

Ethics declarations

Competing interests.

The author declares no competing interests.

Additional information

Related links.

The Clinical Practice Research Datalink (CPRD) Aurum database: https://cprd.com/cprd-aurum-march-2021

The Rheumatology Informatics System for Effectiveness (RISE) registry: https://www.rheumatology.org/Practice-Quality/RISE-Registry

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Richter, J.G., Thielscher, C. New developments in electronic health record analysis. Nat Rev Rheumatol 19 , 74–75 (2023). https://doi.org/10.1038/s41584-022-00894-1

Download citation

Published : 19 December 2022

Issue Date : February 2023

DOI : https://doi.org/10.1038/s41584-022-00894-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

analysis of ehr data for clinical research

EHR-Safe: Generating high-fidelity and privacy-preserving synthetic electronic health records

analysis of ehr data for clinical research

Analysis of Electronic Health Records ( EHR ) has a tremendous potential for enhancing patient care, quantitatively measuring performance of clinical practices, and facilitating clinical research. Statistical estimation and machine learning (ML) models trained on EHR data can be used to predict the probability of various diseases (such as diabetes ), track patient wellness , and predict how patients respond to specific drugs . For such models, researchers and practitioners need access to EHR data. However, it can be challenging to leverage EHR data while ensuring data privacy and conforming to patient confidentiality regulations (such as HIPAA ).

Conventional methods to anonymize data (e.g., de-identification ) are often tedious and costly. Moreover, they can distort important features from the original dataset, decreasing the utility of the data significantly; they can also be susceptible to privacy attacks . Alternatively, an approach based on generating synthetic data can maintain both important dataset features and privacy.

To that end, we propose a novel generative modeling framework in “ EHR-Safe: Generating High-Fidelity and Privacy-Preserving Synthetic Electronic Health Records ". With the innovative methodology in EHR-Safe, we show that synthetic data can satisfy two key properties: (i) high fidelity (i.e., they are useful for the task of interest, such as having similar downstream performance when a diagnostic model is trained on them), (ii) meet certain privacy measures (i.e., they do not reveal any real patient's identity). Our state-of-the-art results stem from novel approaches for encoding/decoding features, normalizing complex distributions, conditioning adversarial training, and representing missing data.

Challenges of Generating Realistic Synthetic EHR Data

There are multiple fundamental challenges to generating synthetic EHR data. EHR data contain heterogeneous features with different characteristics and distributions. There can be numerical features (e.g., blood pressure) and categorical features with many or two categories (e.g., medical codes , mortality outcome). Some of these may be static (i.e., not varying during the modeling window), while others are time-varying, such as regular or sporadic lab measurements. Distributions might come from different families — categorical distributions can be highly non-uniform (e.g., for under-represented groups) and numerical distributions can be highly skewed (e.g., a small proportion of values being very large while the vast majority are small). Depending on a patient's condition, the number of visits can also vary drastically — some patients visit a clinic only once whereas some visit hundreds of times, leading to a variance in sequence lengths that is typically much higher compared to other time-series data. There can be a high ratio of missing features across different patients and time steps, as not all lab measurements or other input data are collected.

EHR-Safe: Synthetic EHR Data Generation Framework

EHR-Safe consists of sequential encoder-decoder architecture and generative adversarial networks (GANs), depicted in the figure below. Because EHR data are heterogeneous (as described above), direct modeling of raw EHR data is challenging for GANs. To circumvent this, we propose utilizing a sequential encoder-decoder architecture, to learn the mapping from the raw EHR data to the latent representations, and vice versa.

While learning the mapping, esoteric distributions of numerical and categorical features pose a great challenge. For example, some values or numerical ranges might dominate the distribution, but the capability of modeling rare cases is essential. The proposed feature mapping and stochastic normalization (transforming original feature distributions into uniform distributions without information loss) are key to handling such data by converting to distributions for which the training of encoder-decoder and GAN are more stable (details can be found in the paper ). The mapped latent representations, generated by the encoder, are then used for GAN training. After training both the encoder-decoder framework and GANs, EHR-Safe can generate synthetic heterogeneous EHR data from any input, for which we feed randomly sampled vectors. Note that only the trained generator and decoders are used for generating synthetic data.

We focus on two real-world EHR datasets to showcase the EHR-Safe framework, MIMIC-III and eICU . Both are inpatient datasets that consist of varying lengths of sequences and include multiple numerical and categorical features with missing components.

Fidelity Results

The fidelity metrics focus on the quality of synthetically generated data by measuring the realisticness of the synthetic data. Higher fidelity implies that it is more difficult to differentiate between synthetic and real data. We evaluate the fidelity of synthetic data in terms of multiple quantitative and qualitative analyses.

Visualization

Having similar coverage and avoiding under-representation of certain data regimes are both important for synthetic data generation. As the below t-SNE analyses show, the coverage of the synthetic data (blue) is very similar with the original data (red). With membership inference metrics (will be introduced in the privacy section), we also verify that EHR-Safe does not just memorize the original train data.

Statistical Similarity

We provide quantitative comparisons of statistical similarity between original and synthetic data for each feature. Most statistics are well-aligned between original and synthetic data — for example a measure of the KS statistics , i.e,. the maximum difference in the cumulative distribution function (CDF) between the original and the synthetic data, are mostly lower than 0.03. More detailed tables can be found in the paper . The figure below exemplifies the CDF graphs for original vs. synthetic data for three features — overall they seem very close in most cases.

Because one of the most important use cases of synthetic data is enabling ML innovations, we focus on the fidelity metric that measures the ability of models trained on synthetic data to make accurate predictions on real data. We compare such model performance to an equivalent model trained with real data. Similar model performance would indicate that the synthetic data captures the relevant informative content for the task. As one of the important potential use cases of EHR, we focus on the mortality prediction task . We consider four different predictive models: Gradient Boosting Tree Ensemble (GBDT), Random Forest (RF), Logistic Regression (LR), Gated Recurrent Units (GRU).

In the figure above we see that in most scenarios, training on synthetic vs. real data are highly similar in terms of Area Under Receiver Operating Characteristics Curve (AUC). On MIMIC-III, the best model (GBDT) on synthetic data is only 2.6% worse than the best model on real data; whereas on eICU, the best model (RF) on synthetic data is only 0.9% worse.

Privacy Results

We consider three different privacy attacks to quantify the robustness of the synthetic data with respect to privacy.

  • Membership inference attack : An adversary predicts whether a known subject was a present in the training data used for training the synthetic data model.
  • Re-identification attack : The adversary explores the probability of some features being re-identified using synthetic data and matching to the training data.
  • Attribute inference attack : The adversary predicts the value of sensitive features using synthetic data.

The figure above summarizes the results along with the ideal achievable value for each metric. We observe that the privacy metrics are very close to the ideal in all cases. The risk of understanding whether a sample of the original data is a member used for training the model is very close to random guessing; it also verifies that EHR-Safe does not just memorize the original train data. For the attribute inference attack, we focus on the prediction task of inferring specific attributes (e.g., gender, religion, and marital status) from other attributes. We compare prediction accuracy when training a classifier with real data against the same classifier trained with synthetic data. Because the EHR-Safe bars are all lower, the results demonstrate that access to synthetic data does not lead to higher prediction performance on specific features as compared to access to the original data.

Comparison to Alternative Methods

We compare EHR-Safe to alternatives ( TimeGAN , RC-GAN , C-RNN-GAN ) proposed for time-series synthetic data generation. As shown below, EHR-Safe significantly outperforms each.

Conclusions

We propose a novel generative modeling framework, EHR-Safe, that can generate highly realistic synthetic EHR data that are robust to privacy attacks. EHR-Safe is based on generative adversarial networks applied to the encoded raw data. We introduce multiple innovations in the architecture and training mechanisms that are motivated by the key challenges of EHR data. These innovations are key to our results that show almost-identical properties with real data (when desired downstream capabilities are considered) with almost-ideal privacy preservation. An important future direction is generative modeling capability for multimodal data, including text and image, as modern EHR data might contain both.

Acknowledgements

We gratefully acknowledge the contributions of Michel Mizrahi, Nahid Farhady Ghalaty, Thomas Jarvinen, Ashwin S. Ravi, Peter Brune, Fanyu Kong, Dave Anderson, George Lee, Arie Meir, Farhana Bandukwala, Elli Kanal, and Tomas Pfister.

Medidata Blog

Solving the ehr-to-edc challenge: a scalable-first approach.

Solving the EHR-to-EDC Challenge: A Scalable-first Approach

Electronic data capture (EDC) systems have long been used to collect, clean, transfer, and process clinical data. But despite the widespread adoption of electronic health record (EHRs) in the healthcare system, their implementation in clinical research has been slow and challenging —even though they’re recognized as a highly rich source of information that provide many benefits for clinical research, particularly with EHR to EDC integration, including:

  • reduced trial costs
  • faster trial completion
  • increased generalizability of results, enhanced recruitment
  • expanded scope of research
  • earlier identification of safety events

Clinical researchers have long sought to repurpose EHR data at scale to support clinical research. As explained in our new white paper Solving the EHR-to-EDC Challenge: A Scalable-first Approach , multiple hurdles have hindered progress toward a truly scalable solution, including poor interoperability between EHRs and other systems, data quality issues, and the sheer volume of data in modern clinical trials. Although limited solutions have been developed for EHR-to-EDC data extraction, they lack scalability due to limited implementation options and the need for extensive IT infrastructure and data transfer agreements.

Why Solve the EHR-to-EDC Challenge? 

Approximately 70% of the data entered into EDC systems are duplicated from EHRs and other source systems , which has become a major pain point for site research coordinators. It’s not surprising that we often hear them say: 

"Why am I manually entering data that’s already available somewhere else?"

Today’s industry standards require research coordinators to identify and review specific patient and visit records in the EHR (and other systems) and then determine what data needs to be transferred to the EDC from specific reports. This is carried out by toggling between two systems and then manually entering the relevant data. This manual re-entry is an enormous challenge for sites and also negatively affects sponsors and partners. 

Industry Changes Have Finally Enabled a Scalable Solution

Key industry changes have paved the way for a scalable multidisciplinary approach to solving the EHR-to-EDC challenge, focusing on presenting data to users rather than solely mapping it. 

The 21st Century Cures Act (2016) aimed to enhance interoperability and reduce regulatory burdens associated with EHR systems. The Office of the National Coordinator for Health Information Technology (ONC) adopted API-enabled “read” services and recommended the HL7® FHIR® standard, which has let the health applications market leverage data from any EHR in a standardized format. Although the process of extracting data from EHRs and feeding them into EDC systems remains complex, these changes have provided a tailwind and a standardized approach for data exchange. Furthermore, regulatory agencies have recognized the value of EHR data in clinical research and encouraged its use in guidances and recommendations. 

Medidata’s Multi-pronged Approach to Overcome the EHR-to-EDC Challenge

Medidata has developed a uniquely scalable, easy-to-use solution for EHR-to-EDC data capture. Rave Companion is a data entry assistant for clinical trial sites using Rave EDC. When enabled with Medidata Health Record Connect , Companion presents matching EHR data for the EDC form, enabling completion of forms up to 90% faster than manual data entry. Health Record Connect is a healthcare data interoperability engine for securely and compliantly acquiring, transforming, and exchanging electronic health record (EHR) data. Health Record Connect has out-of-the-box connectivity to over 90% of the top research sites in the US and thousands more. Unlike other EHR-to-EDC solutions that can take months to implement on each study, Health Record Connect doesn’t require site-by-site EHR system integrations and data transfer agreements or complex study- and system-specific EHR-to-protocol mappings, so it’s up and running immediately. Also, sites don’t need to log in and use yet another system; Rave Companion automatically pops up when they open a Rave EDC form.

Integrating EHRs into clinical research holds immense potential for advancing medical knowledge and improving patient outcomes. By addressing interoperability challenges, ensuring data quality, and streamlining data transfer processes, researchers can leverage the rich information within EHRs to enhance various aspects of clinical research. Medidata’s Rave Companion aims to simplify the capture of EHR and other source data into EDC systems, providing a user-friendly and scalable solution for clinical research.

Download a copy of our white paper to learn how to revolutionize your EHR-to-EDC processes.

analysis of ehr data for clinical research

Does Your Electronic Data Capture (EDC) System Provide Enough Flexibility?

The importance of edc: how edc can support early-stage trials and beyond, medidata’s rave edc solution: all about value, subscribe to our blog newsletter, you decide what cookies medidata will use, what are cookies.

Necessary cookies are essential and are used to provide you with services available through Medidata website. For instance, these cookies allow Medidata to remember your choices about cookies preferences, to record your interface customization trackers e.g. for the choice of language used by the website. Necessary cookies are enabled by default and cannot be switched off. To see the list of the cookies used for this purpose, click here .

Functional cookies are used to provide you with contents and proposals that correspond to your interactions. They may consist of information logged on your device or recorded as you navigate through Medidata website. These cookies also allow us to analyze site usage so we can measure and improve performance. To see the list of the cookies used for these purposes, click here .

Advertising cookies are used to enable Medidata and its trusted Medidata business stakeholders to serve ads that are relevant to your interests. The intention is to display ads that are relevant to you.

  • Microsoft Clarity
  • Microsoft Bing
  • Oracle / Eloqua

To see the list of the cookies used for this purpose, click here .

LnkIn

  • Open access
  • Published: 02 December 2022

Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications

  • Min Su 1   na1 ,
  • Tao Pan 2   na1 ,
  • Qiu-Zhen Chen 1   na1 ,
  • Wei-Wei Zhou 3   na1 ,
  • Yi Gong 1 , 4 ,
  • Gang Xu 2 ,
  • Huan-Yu Yan 1 ,
  • Qiao-Zhen Shi 1 ,
  • Ya Zhang 2 ,
  • Xiao He 5 ,
  • Chun-Jie Jiang 6 ,
  • Shi-Cai Fan 7 ,
  • Murray J. Cairns 8 , 9 ,
  • Xi Wang   ORCID: orcid.org/0000-0002-7572-6354 1 &
  • Yong-Sheng Li 2  

Military Medical Research volume  9 , Article number:  68 ( 2022 ) Cite this article

18k Accesses

8 Citations

14 Altmetric

Metrics details

The application of single-cell RNA sequencing (scRNA-seq) in biomedical research has advanced our understanding of the pathogenesis of disease and provided valuable insights into new diagnostic and therapeutic strategies. With the expansion of capacity for high-throughput scRNA-seq, including clinical samples, the analysis of these huge volumes of data has become a daunting prospect for researchers entering this field. Here, we review the workflow for typical scRNA-seq data analysis, covering raw data processing and quality control, basic data analysis applicable for almost all scRNA-seq data sets, and advanced data analysis that should be tailored to specific scientific questions. While summarizing the current methods for each analysis step, we also provide an online repository of software and wrapped-up scripts to support the implementation. Recommendations and caveats are pointed out for some specific analysis tasks and approaches. We hope this resource will be helpful to researchers engaging with scRNA-seq, in particular for emerging clinical applications.

Complex tissues consist of a variety of cell types that occur in a huge variety of mixtures and states. The functional genomic information contained within each cell is often quite different from the neighboring cell populations and even cells of the same type. This means that the molecular analyses of cell populations in bulk tissues are inherently unreliable and insensitive. The incredible sensitivity and specificity that can be achieved by quantifying molecular alterations at single-cell resolution have led to unprecedented opportunities for uncovering the molecular mechanisms underlying the pathogenesis and progression of the disease [ 1 ]. Since its inception, single-cell RNA-sequencing (scRNA-seq) has been shown to be a powerful tool for profiling gene expression in individual cells [ 2 , 3 , 4 ], in both physiogenesis [ 5 , 6 ] and pathogenesis [ 7 , 8 , 9 ]. For example, by utilizing scRNA-seq in cancer biology [ 10 , 11 ], researchers have been able to determine the origin of cancer cells in various tumor types [ 12 , 13 ]. Moreover, from the treatment and prognosis respect, subpopulations of malignant cells with clinically significant features, such as the poor prognosis in nasopharyngeal carcinoma with dual epithelial–immune characteristics have been discovered [ 14 ]. Similarly, strong epithelial-to-mesenchymal transition (EMT) and stemness signatures were observed in metastatic breast cancer cells [ 15 , 16 ]. With the assistance of scRNA-seq, the quality and validity of organoid systems can also be accurately assessed and systematically evaluated [ 17 , 18 , 19 ]. Patient-derived organoid models are currently being applied to the dissection of disease pathology [ 20 ] and facilitating drug screening for personalized treatment [ 21 , 22 ]. Furthermore, distinct cellular states along tumor progress were discovered and drug-resistant cell subsets were identified by joint application of patient-derived organoid and scRNA-seq [ 23 , 24 ]. In the current coronavirus disease 2019 (COVID-19) pandemic, scRNA-seq accelerates the research for characterizing the molecular basis and, therefore, understanding the pathology of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A variety of scRNA-seq-based studies have revealed the cell subtypes targeted by SARS-CoV-2 [ 25 ], profiled gene expression changes in immune cells upon infection [ 26 , 27 ], quantified the alteration of cell-to-cell interaction between different cell types [ 26 , 28 ], and provided important resources for the development of potential treatment of COVID-19 [ 26 , 28 ].

Since the emergence of commercial single-cell platforms, including those offered by 10 × genomics [ 29 , 30 ] and Singleron [ 31 , 32 ], scRNA-seq services provided by core facilities of research institutes or third-party companies, are making the technology more accessible, affordable and in some cases a routine technique for biomedical researchers and clinicians [ 33 ]. While these service providers typically perform data quality-control and execute basic pipelines for data processing, the high-level data analysis needed for specific research objectives and scientific questions, is not usually available. Thus, most biomedical researchers need to come to grip with the full scope of scRNA-seq data analysis by identifying the most suitable computational tools to dissect their data.

To overcome the barriers in scRNA-seq data analysis, in particular for biomedical studies, this review aims to: 1) summarize the recent advances in algorithm development and benchmarking results for every analysis task in analyzing biomedical scRNA-seq data, and 2) introduce a workflow comprised of recommended software tools that are more appropriate for biomedical applications. The workflow covers basic scRNA-seq data processing, quality control (QC), feature selection, dimensionality reduction, cell clustering and annotation, trajectory inference, cell–cell communications (CCC), transcription factor (TF) active prediction and metabolic analysis. Along with the recommended workflow, we also provide example computational scripts together with the software environment setting, which may facilitate researchers to conduct the data analysis locally. The computational code is available at https://github.com/WXlab-NJMU/scrna-recom . To accommodate upcoming advanced approaches and more application scenarios, we will keep the computational scripts updated.

General tasks of single-cell RNA-seq data analysis

Typical data analysis steps of scRNA-seq can be generally divided into three stages: raw data processing and QC, basic data analysis applicable for almost all scRNA-seq data sets, and advanced data analysis that should be tailored to specific research scenarios. While basic data analysis steps include data normalization and integration, feature selection, dimensionality reduction, cell clustering, cell type annotation and marker gene identification. The advanced data analysis tasks consist of trajectory inference, CCC analysis, regulon inference and TF activity prediction, and metabolic flux estimation.

Experimental design

ScRNA-seq experiments need to be carefully designed to optimize the capability in addressing scientific questions [ 34 ]. Before starting the data analysis, the following information related to the experiment design needs to be gathered. (1) Species. For biomedical studies and clinical applications, human samples derived from patients are usually collected for sequencing [ 35 , 36 , 37 ]. In some cases, to study the underlying molecular mechanisms, mouse and other model organisms are also used [ 38 ]. Since the gene names and related data resources are different between humans and other species, it is important to specify the species for data analysis. For simplicity, we will focus on the data derived from human samples. (2) Sample origin. According to the scientific questions and sample accessibility, the sample types can be varied in different studies. For instance, to study solid tumors like hepatocellular carcinoma, tumor biopsies and peritumor samples are collected from patients for a case–control design [ 39 ]. Whereas the above design is feasible to some extent, peripheral blood mononuclear cells (PBMCs) are more easily accessible and widely used for scRNA-seq [ 40 , 41 ]. In addition, cells from patient-derived organoids are often used to study the impact of personal genetic variants on the development of specific organs, which can also be the origin of particular diseases [ 42 , 43 ]. Knowing the sample origin facilitates particular analysis, such as cell clustering and cell type annotation. (3) Experiment design. To study disease pathogenesis and the effectiveness of particular treatments, a case–control design is mostly adopted, like the tumor-versus-peritumor design [ 39 ]. For diseases such as COVID-19, no normal samples can be obtained from the same patients, thus healthy people with matched age and gender serve as a control group [ 40 ]. To control possible covariates between the patients and the control groups, the number of individuals in each group needs to be carefully considered [ 44 ]. In (prospective) cohort studies, the sample size is usually considerably larger, so that scRNA-seq cannot be applied to every sample from individual donors; in this case, nested case–control studies [ 45 ] and sample multiplexing [ 46 ] are often applied. In general, data analysis strategies need to be adjusted according to the types of the experiment design.

Raw data processing

Raw data processing steps include: sequencing read QC, read mapping [ 47 ], cell demultiplexing and cell-wise unique molecular identifier (UMI)-count table generation [ 48 ]. Whilst standardized data processing pipelines are provided with the release of scRNA-seq platforms, such as Cell Ranger for 10 × Genomics Chromium [ 49 ] and CeleScope ( https://github.com/singleron-RD/CeleScope ) for Singleron’s systems, alternative tools including UMI-tools [ 48 ], scPipe [ 50 ], zUMIs [ 51 ], celseq2 [ 52 ], kallisto bustools [ 53 ], and scruff [ 54 ] can also be used for this procedure. The choice between these pipelines seems less important than the downstream steps according to a recent study benchmarking scRNA-seq analysis [ 55 ]. In any case, we would not recommend raw data processing on personal computers, as these pipelines need massive computational resources and are optimized for high-performance computing architectures [ 56 ]. Third-party companies usually provide processed data, including UMI count matrices and QC metrics, which enable the researchers to focus on downstream data analysis for addressing scientific questions.

QC and doublet removal

The purpose of cell QC is to make sure all the ‘cells’ being analyzed are single and intact cells. Damaged cells, dying cells, stressed cells and doublets need to be discarded [ 57 , 58 ]. In ultrahigh-throughput scRNA-seq, quantitative metrics used for bulk RNA-seq QC, including read mappability, fraction of reads mapped to exonic regions are computed at only the sample/library level, thus cannot be used for cell QC. Instead, the three mostly used metrics for cell QC are: the total UMI count (i.e., count depth), the number of detected genes, and the fraction of mitochondria-derived counts per cell barcode [ 56 , 59 ]. Cell Ranger [ 49 ] and CeleScope ( https://github.com/singleron-RD/CeleScope ) usually perform a first-round cell QC, which distinguishes potentially authentic cells from background cell barcodes by examining the distribution of count depth in a scRNA-seq library. One caveat is that, when the damaged cells or cell debris take a considerable proportion in the library, the threshold of a minimum count depth for valid cells is hard to be determined. Possible solutions include the consideration of multiple QC metrics at the same time [ 56 ], and the application of more sophisticated approaches to rule out background and low-quality cells [ 60 ]. Typically, low numbers of detected genes and low count depth indicate damaged cells, whereas a high proportion of mitochondria-derived counts is indicative of dying cells. By contrast, too many detected genes and high count depth can be indicative of doublets [ 57 , 58 ]. While R packages like Seurat [ 61 , 62 , 63 ] and Scater [ 64 ] implement functions to facilitate cell QC, the thresholds of the QC metrics are largely dependent on the tissue studied, cell dissociation protocol, library preparation protocol, etc.. Referring to publications with similar experiment designs would help to determine the thresholds, and advanced researchers may also inspect the joint distribution of the QC metrics. Notably, accumulated expression of genes encoding ribosomal proteins is not a typical QC metric, as the variation of ribosomal protein expression can be biologically meaningful [ 65 ].

In addition, various sources of contamination need to be considered and controlled during the QC step. For example, libraries derived from PBMCs and solid tissues can be contaminated by red blood cells, and thus cells expressing a high level of hemoglobin genes (e.g., HBB ) are usually discarded [ 66 , 67 ]. Another source of contamination is cell-free or ambient RNA, as evidenced by reads mapped back to specific genes in cell-free droplets or wells in high-throughput scRNA-seq [ 68 , 69 ]. Methods and tools for estimating and removing such contamination have been recently developed, including SoupX [ 68 ], DecontX [ 69 ], fast correction for ambient RNA (FastCAR) [ 70 ] and CellBender [ 71 ]. Removal of the background signal caused by ambient RNA in single-cell gene expression improves downstream analyses and biological interpretation [ 69 , 71 ].

In high-throughput scRNA-seq experiments, it is not uncommon to observe a high rate of doublets, which may reach up to 40% of cell barcodes [ 72 , 73 ]. For this reason, a filtering step that only considers count depth and the number of detected genes is not adequate, particularly when the cell type composition is complex such that the count depth distribution of singlets is not distinct from that of doublets. Doublets composed of distinct cell types are likely to confound downstream analysis, particularly in cell clustering, differential expression analysis, and trajectory inference [ 56 , 74 ]. Fortunately, a number of sophisticated approaches have been developed to disentangle these confounding signals [ 72 ]. These methods consider the gene expression profiles of individual cell barcodes and report doublet scores as an indicator. The doublet scores are calculated based on either artificial doublets [such as single-cell remover of doublets (Scrublet) [ 74 ], doubletCells [ 75 ], binary classification based doublet scoring (bcds) [ 76 ], DoubletDetection [ 77 ], DoubletFinder [ 78 ], Solo [ 73 ], DoubletDecon [ 79 ]] or gene co-expression [such as co-expression based doublet scoring (cxds) [ 76 ]]. In a recent study, benchmarking the available computational doublet-detection methods with a comprehensive set of synthetic and real data [ 72 ], the tool Doubletfinder [ 78 ] was recommended because it achieved both the highest detection accuracy and the best performance in downstream analysis.

Expression normalization

The variability of total UMI counts per cell depends on a range of both technical and biological parameters [ 56 ]. The technical factors relate to the efficiency of RNA capture, reverse transcription, cDNA amplification and sequencing depth, whereas the biological factors mostly relate to cell size and cell cycle phase. Because of this variation, it is almost impossible to obtain the absolute number of RNA molecules unless external spike-in RNA control is added to the sequencing libraries [ 80 , 81 ]. Like bulk RNA-seq, relative RNA abundance is commonly adopted for comparing gene expression profiles between individual cells; therefore, scRNA-seq data are typically normalized by global-scaling methods with scaling factors developed for bulk RNA-seq [ 82 , 83 , 84 ], which suppress partially the technical effects [ 56 ]. Popular global-scaling methods for bulk RNA-seq include transcript per million (TPM) [ 85 ], upper quartile (UQ) normalization [ 86 ], trimmed mean of M values (TMM) normalization [ 87 ], and the DESeq normalization method [ 88 ], however, are not appropriate for scRNA-seq due the tendency for distortion through zero inflation [ 81 ]. Normalization methods tailored for scRNA-seq, including single-cell differential expression (SCDE) [ 84 ] and model-based analysis of single-cell transcriptomics (MAST) [ 82 ], can specifically model dropout events in differential expression analysis of scRNA-seq data. Another approach, Scran [ 75 ], overcomes the issues of scaling factor estimation (affected by too many zero counts) by pooling cells of similar gene expression profiles [ 89 ]. Moreover, Census estimates the total number of RNA molecules per cell without spike-in controls and uses these estimates as the scaling factors [ 90 ]. While simulation studies carried out by Vallejos et al. [ 81 ] suggested Scran’s pooling strategy outperforms compared tools in scaling factor estimation, the TPM-/count depth-scaling method is widely used in practice [ 91 ].

Following scaling factor-based normalization, the resulting values are typically added to one pseudo-count and log-transformed [ 56 , 62 ]. This step is practically useful and statistically sound, as it mitigates the mean–variance relationship in scRNA-seq count data and also reduces the skewness in expression data [ 56 , 64 ]. Toward better variance stabilization, SCTransform was recently developed by the Seurat team, which applies regularized negative binomial regression for scRNA-seq data normalization and variance stabilization [ 92 ].

Some known biological effects, such as cell cycle and cell stress (featured by overexpression of mitochondrial genes), may hinder the characterization of the particular biological signal of interest [ 56 ]. Hence, normalizing or correcting expression profiles against known biological may help interpret the data. For instance, correcting the effects of the cell cycle can improve developmental trajectory reconstruction [ 93 , 94 ]. The procedure accounting for biological effects can be achieved by scoring related biological features (e.g., cell cycle scores [ 95 ]), followed by a simple linear regression against the calculated scores as implemented in Seurat [ 61 , 62 ]. In addition, dedicated tools such as single-cell latent variable model (scLVM)/factorial single-cell latent variable model (f-scLVM) [ 93 , 96 ] and cell growth correction (cgCorrect) [ 97 ] can also be used for this purpose. Of note, correcting biological effects for one particular analysis (e.g., cell differentiation) may unintentionally hinder the signals for another (e.g., cell proliferation) [ 56 ]; care should be taken when choosing data normalization strategies for particular analysis tasks.

Data integration

As mentioned in the ‘Experiment design’ section, biomedical studies usually make case versus control comparisons [ 39 ]. Usually, batches of samples obtained from different medical centers or hospitals should be integrated before downstream analysis. For studies using patient-derived organoids, data integration also applies to cells harvested at different time points to depict organoid development [ 98 ]. In these cases, one other unwanted technical factor, batch effects, cannot be avoided because cells and library preparation were handled by different persons, at different time points, or with a different batch of reagents [ 91 , 99 ]. In scRNA-seq, batch effects can be nonlinear, which may not be easily disentangled by state-of-the-art batch correction tools, such as ComBat [ 100 ]. Therefore, numerous methods have been recently developed for batch effect correction in scRNA-seq data integration, trying to relieve or remove the effects caused by batch-specific biases while preserving biological variations [ 56 , 99 ]. The batch effect correction methods can be classified into a few categories: 1) tools developed for bulk expression analysis, including ComBat [ 100 ] and limma [ 101 ]; 2) approaches based on mutual nearest neighbors (MNN) in high-dimensional gene expression space or its subspace, such as mnnCorrect [ 102 ], fastMNN [ 102 ], Scanorama [ 103 ] and batch balanced k nearest neighbours (BBKNN) [ 104 ]; 3) methods that try to align cells with correlated/shared features in dimensionality-reduced spaces, including canonical correlation analysis (CCA) [ 61 , 62 ], Harmony [ 105 ], and linked inference of genomic experimental relationships (LIGER) [ 106 ]; and 4) methods based on deep generative models, such as scGen [ 107 ]. Besides, depending on the choice of integration anchors, the algorithms can also be sorted into different types, such as genomic features as the anchor and cells as the anchor [ 108 ].

Recently, Tran et al. [ 99 ] compared 14 batch-effect correction methods available at that time on 10 datasets under 5 different integration scenarios. Among them, Harmony [ 105 ], LIGER [ 106 ], and CCA implemented in Seurat 3 [ 62 ] were recommended according to their overall performance [ 99 ]. Together with our experience, it is suggested to perform data integration with Harmony, Seurat3/4-CCA, and LIGER in order. This is because there is no clear winner among the three strategies when dealing with distinct datasets [ 99 ]. Harmony runs faster than the other tools, suitable for initial exploration; Seurat3/4-CCA is moderate in mixing cells from different batches, whereas LIGER makes the best efforts in batch mixing, sometimes at the cost of cell type purity. Of note, if one wants to evaluate the effectiveness of batch-effect correction or assess the extent of the batch effects in the data, it can be achieved by comparing clustering or visualization results based on batch-effect corrected analysis and that from directly merging cells derived from multiple samples (e.g., merge function in Seurat), and by computing test metrics such as k-nearest-neighbor batch-effect test (kBET) [ 91 ].

Feature selection

While cell QC removes background cells and problematic cells, the feature section is concerning genes. In the human genome, more than 20,000 genes are annotated, and mapped reads are counted for individual gene loci to yield the UMI count matrix. However, not all the > 20,000 genes are informative in characterizing cell-to-cell heterogeneity or distinguishing cell types/states [ 56 ]. Therefore, the term ‘feature selection’ was borrowed from the fields of statistics and machine learning to describe the process of selecting biologically informative genes for downstream analysis. This process is typically unsupervised, meaning that no information related to cell types or other biological processes of interest is needed.

Considering the relatively high noise level in scRNA-seq data, feature selection usually identifies genes with stronger biological variability than technical noise [ 58 ]. Since the technical noise largely depends on the mean expression of genes [ 109 ], highly variable genes (HVGs) were originally identified by examining the relationship between the coefficient of variation and expression means [ 58 ]. Due to its usefulness in reducing technical noise and relieving the computational demand in downstream analysis, such as cell clustering and dimensionality reduction for visualization [ 110 ], many other tools for HVG identification were developed and comparatively evaluated [ 111 , 112 , 113 ]. Instead of identifying HVGs, alternative feature selection methods consider dropouts and prioritize genes with a higher-than-expected number of observed zeros [ 114 ].

The number of genes selected for downstream analysis is theoretically dependent on the complexity of cellular composition in the samples studied. While approaches for HVG identification can determine the number of HVGs at a given significance level, identifying a fixed number of HVGs is becoming popular, and typically the HVG number is between 1000 and 5000 [ 56 ]. Studies have shown that downstream analysis is not sensitive to the exact number of HVGs [ 110 , 115 ]. Notably, some unfavorable covariates such as batch effect may distort HVG identification [ 82 ]. Therefore, HVG selection should be performed after correction for the covariates. In the presence of batch effects, feature selection may also be conducted in individual samples before data integration [ 56 ].

Dimensionality reduction and visualization

With 1000–5000 HVGs selected, the dimensionality of the expression data is still high, thus obstructing manual inspection of the dataset, such as visualization, clustering and cell type annotations [ 116 ]. To this end, the dimensions of the expression matrixes can be further reduced by dimensionality reduction techniques, which project the cells from a high-dimensional space into a low-dimensional embedding space, and preserve the biological information on cell-to-cell variability [ 56 , 59 ]. The widely used methods for dimensionality reduction include principal component analysis (PCA) [ 117 ], non-negative matrix factorization (NMF) [ 118 ], multi-dimensional scaling (MDS) [ 119 ], t-distributed stochastic neighbor embedding (t-SNE) [ 120 ] and uniform manifold approximation and projection (UMAP) [ 121 ].

PCA is a general technique for dimensionality reduction and denoising, and has been widely used in scRNA-seq data analysis [ 122 , 123 ]. With the linear projection of the original expression matrix to its subspace, PCA gives the principal components (PCs) in order of significance. While the first two or three PCs can be used for visualization, a few more PCs are typically retained for downstream analysis, such as cell clustering and trajectory inference. The number of PCs for retention largely depends on the complexity of the dataset [ 59 ], and can be determined by the “elbow” method [ 56 ] or the jackstraw permutation-test-based method [ 95 , 124 ]. Nevertheless, PCA cannot take into account the dropout events in the analysis, which leads to the development of several new methods. Zero-inflated factor analysis (ZIFA) is one of such methods based on factor analysis, which explicitly models the dropout characteristics and outperforms the comparative methods [ 125 ]. Similar to PCA, NMF is a linear projection method for dimensionality reduction, and showed robust performance in cell clustering based on scRNA-seq [ 118 ].

For visualization, nonlinear dimensionality reduction methods are more suitable, which allow a global nonlinear embedding in a two-/three-dimensional space [ 126 ]. MDS is one of the nonlinear dimensionality reduction methods and preserves the distance among the cells in the original space [ 119 ]. However, MDS can be not scalable to large-scale scRNA-seq data because calculating the pairwise distances becomes computationally demanding when the number of cells is huge [ 127 ]. Emerging evidence suggests t-SNE and UMAP are more suitable for scRNA-seq data, which have been widely used in single-cell analysis for data visualization and cell population identification. However, t-SNE usually suffers from limitations such as slow computation time for large-scale scRNA-seq datasets [ 128 ] and global data structure was not preserved [ 121 ]. With advantages in the above two respects, UMAP currently becomes the most popular choice for dimensionality reduction. UMAP not only helps visualize the cell clusters but also facilitates annotating the cell clusters. It is worth noting, however, that while UMAP strikes a balance between preserving global data structure and capturing local similarity, the cell-to-cell distance in the resulted space is not preserved. Hence, downstream analysis like clustering and pseudotime inference is typically executed based on the PCA results with several to dozens of PCs.

Identification of cell subpopulations

One of the key applications in single-cell transcriptomics is to determine cell subpopulations based on cell clustering or classification [ 129 , 130 ]. Due to the high level of noise in the scRNA-seq data, applying dimensionality reduction approaches to scRNA-seq matrix data may facilitate cell clustering. Whilst PCA is commonly used for bulk RNA-seq, the true biological variability of gene expression among cell subpopulations may not be readily distinguished by a small number of PCs. To better account for this variation, NMF was adapted to disentangle subpopulations in single-cell transcriptome data [ 118 , 131 ], and has been shown to outperform PCA with greater accuracy and robustness (Fig.  1 ). Likewise, SinNLRR was developed to provide robust clustering of gene expression subspace by non-negative and low-rank representation [ 132 ].

figure 1

Typical computational strategies and methods for clustering cells using scRNA-seq data. With the processed scRNA-seq data, the SC3 approach, the Seurat clustering implementation based on the community detection method, and the NMF method are popular choices. scRNA-seq single-cell RNA sequencing, SC3 single-cell consensus clustering, NMF non-negative matrix factorization, PC principal component, SNN shared nearest neighbor, scVDMC variance-driven multitask clustering of scRNA-seq data, SIMLR single-cell interpretation via multikernel learning, UMAP uniform manifold approximation and projection, t-SNE t-distributed stochastic neighbor embedding

State-of-the-art clustering methods, such as the k-means algorithm, have also been applied to scRNA-seq datasets, and based on this application, the single-cell consensus clustering (SC3) approach was developed [ 133 ] (Fig.  1 ). Another category of popularly used methods for cell clustering in scRNA-seq is community detection methods based on a nearest-neighbor network for the cells [ 134 ], and was adopted and implemented in the Seurat R package [ 61 ] (Fig.  1 ). Besides, the community has developed a diversity of approaches for cell clustering. For instance, BackSPIN takes advantage of the biclustering technique to avoid unfavorable pairwise comparisons in hierarchical clustering [ 135 ], single-cell interpretation via multikernel learning (SIMLR) is based on multi-kernel learning [ 136 ], clustering through imputation and dimensionality reduction (CIDR) [ 137 ] utilizes imputation to mitigate the impact of dropouts in scRNA-seq, and Single-cell Aggregated Clustering via Mixture Model Ensemble clustering (SAME-clustering) [ 138 ] ensembles clustering results from multiple methods. Nevertheless, two independent benchmarking studies have shown that SC3 and the clustering method in Seurat perform similarly to each other and outperform all other comparative methods [ 139 , 140 ].

Similarity or distance metrics are crucial for clustering cells in scRNA-seq, which can be specific to experiment platforms or particular samples. It has been shown that, compared to unsupervised clustering methods, supervised methods for cell type identification suffered less from batch effects, number of cell types, and imbalance in cell population composition [ 141 ]. Mechanistically, the supervised methods rely on a comprehensive reference database with known cell types annotated, based on which a classification model is trained for predicting the cell types in an unannotated dataset [ 142 , 143 ]. CellAssign [ 144 ], scmap [ 145 ], single cell recognition (SingleR) [ 146 ], characterization of cell types aided by hierarchical classification (CHETAH) [ 147 ], and SingleCellNet [ 148 ] are methods of this category. Albeit the clear strength of the supervised methods, unsupervised methods are generally better at identifying unknown cell types and have higher computational efficiency [ 141 ]. Therefore, the clustering methods implemented in Seurat have the best overall performance, and are suggested as the first choice of cell type identification [ 141 ].

Another important issue for single-cell clustering analysis is the detection of rare cell types, which play an important role in complex diseases but have a low abundance. RaceID [ 129 ], GiniClust [ 149 ], SINCERA [ 150 ] and DendroSplit [ 151 ] are clustering algorithms specifically designed to identify rare cell types in scRNA-seq data analysis.

Cell type annotation

Assigning cell identities to cell subpopulations, a process known as cell type annotation, is a critical step in scRNA-seq data analysis [ 152 ]. Manual annotation of cell types is time-consuming and potentially subjective. Thus, emerging computational tools have been developed for automatic cell type annotation [ 143 , 152 ]. These computation methods usually can be classified into three main groups (Fig.  2 ).

figure 2

Typical strategies and representative methods for annotating cell subpopulations identified by scRNA-seq. In addition to manual annotation, which is potentially time-consuming and subjective, automated cell type annotation can be mainly sorted into three categories: marker gene-based, reference transcriptome-based, and supervised machine learning-based approaches. The example approach names are listed in the plot. scRNA-seq single-cell RNA sequencing, scCATCH single-cell cluster-based automatic annotation toolkit for cellular heterogeneity, SCINA semi-supervised category identification and assignment, CHETAH characterization of cell types aided by hierarchical classification, SingleR single cell recognition, OnClass ontology-based single cell classification, ACTINN automated cell type identification using neural networks

The first type is marker gene-based, which relies on the availability of cell type-specific markers in public databases or literature. CellMarker [ 153 ] and PanglaoDB [ 154 ] are commonly used online resources storing the markers for a large variety of cell types in the tissues of humans and mouse. CellMarker deposits over 13,000 cell markers of about 500 cell types of humans by manually curating over 100,000 published papers [ 153 ], and PanglaoDB is a community-curated cell marker compendium, containing 6000 markers for different cell types from over 1000 scRNA-seq experiments [ 154 ]. Moreover, the TF-Marker database was developed for providing cell or tissue-specific TFs and related markers for humans [ 155 ]. These databases are valuable resources for cell type annotations. Meanwhile, a number of tools have been developed to use the marker genes for cell type annotations, such as ScType [ 156 ], scSorter [ 157 ], semi-supervised category identification and assignment (SCINA) [ 158 ], single-cell cluster-based automatic annotation toolkit for cellular heterogeneity (scCATCH) [ 159 ] and CellAssign [ 144 ]. Some of these methods apply sophisticated statistical models to make use of the prior knowledge of marker genes. For example, SCINA builds a semi-supervised model to exploit previously identified marker genes with the expectation–maximization (EM) algorithm [ 158 ], and CellAssign leverages a probabilistic graphical model to annotate cells into predefined or novel cell types based on prior knowledge of cell-type marker genes, while accounting for batch and sample effects [ 144 ].

The second group of methods is reference transcriptome-based, which uses cell type-labeled scRNA-seq datasets as input for cell type annotation, via the search for the best correlation between the queried data and the reference data. Popular tools of this group include CHETAH [ 147 ], scmap [ 145 ], scMatch [ 160 ] and SingleR [ 146 ]. The CHETAH algorithm is based on a hierarchical tree built by reference profiles of known cell types, and searches for a cell’s best annotation by stepwise traversing the tree from the root node to a leaf node [ 147 ]. By calculating the correlation coefficients between the input cell and two tree branches under consideration based on the 200 most discriminating genes for the two branches, a profile score and confidence score are calculated for selecting tree branches to continue tree traversing. The SingleR approach correlates each unannotated single-cell transcriptome with the reference transcriptomes of known cell types based on HVGs among cell types in the reference data [ 146 ]. SingleR assigns cell identity in an iterative manner, and in each iteration the reference set is reduced to refine the assignment. Notably, the comprehensiveness of the reference transcriptomics data is critical for this group of methods. The reference data from Blueprint [ 161 ], Encode [ 162 ] and the Human Primary Cell Atlas [ 163 ] are commonly used.

Lastly, the third group leverages supervised machine learning-based approaches, where classifiers trained by a labeled reference are then applied to predict cell types of unannotated cells. For instance, SingleCellNet uses multi-class random forest classifiers [ 148 ], automated cell type identification using neural networks (ACTINN) uses artificial neural networks [ 164 ], scPred uses support vector machine (SVM) [ 165 ], and scClassify uses ensemble learning [ 166 ] for cell type annotation. Furthermore, ontology-based single cell classification (OnClass) may also accurately annotate cell types absent in the training dataset, through identifying the nearest cell type in low-dimensional embeddings resulting from the Cell Ontology and the unannotated cells [ 167 ].

Automated methods for cell type annotation have been applied in a broad range of biomedical studies, including cancer research. However, a recent benchmarking study has demonstrated that every computational method possesses specific advantages over the others under different scenarios [ 142 ], making it however difficult for clinical users to select the appropriate tools. Integrating the annotation results from multiple tools may be a solution to the above issue, and probably achieve more accurate cell types annotation. Therefore, ImmCluster has been developed recently for immune cell clustering and annotation, integrating seven reference-based and four marker gene-based computational methods, supported by manually curated marker gene sets [ 168 ]. Comparative studies have shown that ImmCluster provides more accurate and stable cell type annotation than individual methods [ 168 ].

Marker gene identification

Marker genes of a particular cell cluster or cell type are an important resource for characterizing its function. In reverse, as shown above, marker genes can also be used for cell type annotation. The typical methods to identify cell cluster/type-specific genes are those to identify differentially expressed genes (DEGs) among the clusters based on statistical tests. For example, the scRNA-seq analysis pipelines Seurat [ 169 ] and SINCERA [ 150 ] use the nonparametric Wilcoxon’s rank-sum test to identify highly expressed genes of specific cell types. It has been shown that Wilcoxon’s rank-sum test is of low false positive rates than dedicated methods for sequencing-based DEG analysis [e.g., DESeq2 [ 170 ] and empirical analysis of digital gene expression (DGE) in R (edgeR) [ 171 ] when the sample size is large [ 172 ]]. In addition, the nonparametric Kruskal–Wallis test was adopted in SC3 [ 133 ] for comparisons of more than two groups of cells. Considering dropouts in scRNA-seq and differences in gene expression distribution between cell types or status, many other methods have been developed for marker genes identification, such as MAST [ 82 ], SCDE [ 84 ], and DEsingle [ 173 ].

There is one more category of methods, which identify cell-specific genes simultaneously with the process of cell clustering rather than a step thereafter. As introduced in the earlier section, BackSPIN is based on a biclustering approach [ 135 ], which clusters highly expressed genes together when clustering cells. Similarly, iterative clustering and guide-gene selection (ICGS) first identifies guide genes by pairwise correlation of expressed genes, and then performs iterative clustering with the guide genes [ 174 ]. Moreover, DendroSplit considers marker genes’ significance level in identifying sub-clusters [ 151 ]. Finally, statistically modeling the distribution of gene expression across individual cells, methods like variance-driven multitask clustering of scRNA-seq data (scVDMC) [ 175 ], BPSC [ 176 ] and bias-corrected sequencing analysis (BCseq) [ 177 ] have been developed to improve both cell subtype identification and differential expression analysis.

Regarding the best choice of DEG tools in scRNA-seq, a recent study compared 36 approaches and found fundamental differences between the methods compared [ 178 ]. It has been pointed out that prefiltering of lowly expressed genes may help DEG analysis, and the methods used for bulk RNA-seq analysis in general have comparable performance to those specifically developed for scRNA-seq. Overall, the nonparametric Wilcoxon’s rank-sum test ranks high in most application scenarios, except for complex experimental designs.

Functional enrichment analysis

To facilitate the interpretation and organization of marker genes identified in each cell type, functional enrichment analysis is commonly performed. Computational methods developed for bulk transcriptomics can be easily applied to this analysis, such as Database for Annotation, Visualization, and Integrated Discovery (DAVID) [ 179 ]. This kind of analysis requires a hard cutoff on statistical significance to define the marker genes; in contrast, the widely-used gene set enrichment analysis (GSEA) is a cutoff-free approach [ 180 , 181 ]. GSEA begins with ordering genes based on differential expression statistics between cell populations of interest, followed by statistically assessing if a functionally meaningful gene set or pathway is significantly overrepresented toward the top or bottom of the ranked list. To facilitate GSEA analysis, Molecular Signatures Database (MSigDB) provides a series of annotated gene sets, including pathways and hallmark gene signatures [ 182 ].

Besides the above scenarios where the functional annotation is performed based on marker genes or differential expression between two groups of cells, this analysis can also be carried out at the single-cell level. Single sample GSEA (ssGSEA) and gene set variation analysis (GSVA) [ 183 ], which are analogues to GSEA and designed for enrichment analysis of single bulk samples, have now been widely used in scRNA-seq to compute signature scores [ 184 , 185 ]. Besides, accounting for its characteristics in scRNA-seq, more specific tools including Vision [ 186 ], Pagoda2 [ 187 ], AUCell [ 188 ], single-cell signature explorer (SCSE) [ 189 ] and jointly assessing signature mean and inferring enrichment (JASMINE) [ 190 ] have been proposed, and in general more suitable for signature scoring in scRNA-seq [ 190 ]. In addition, these signature-scoring methods can also be used for pathway activity inference [ 185 ].

Trajectory inference and RNA velocity

In addition to the cell-to-cell heterogeneity that can be captured by scRNA-seq, the dynamics of transcriptomes may also reflect the developmental trajectory or cell state transitions. Trajectory inference [ 191 ], pseudo-time estimation [ 192 ], and RNA velocity modeling [ 193 ] are all helpful to reveal molecular characteristics and regulatory mechanisms during cell differentiation or activation.

Trajectory inference is a popular research field in the past years, with approximately a hundred computational tools developed [ 191 ], facilitating studies in developmental biology, as well as cancer development and immune response status alterations. Furthermore, applying this category of methods may also facilitate the objective identification of new cell types [ 194 ], and the inference of regulatory networks during the development or status transition [ 188 ]. According to the types of trajectories, the trajectory inference methods can also be classified into different categories, including linear methods [e.g., SCORPIUS [ 195 ], tools for single cell analysis (TSCAN) [ 196 ], Wanderlust [ 197 ]], bifurcating methods [e.g., diffusion pseudotime (DPT) [ 198 ], Wishbone [ 199 ]], multifurcation methods [e.g., FateID [ 200 ], STEMNET [ 201 ], mixtures of factor analysers (MFA) [ 202 ]], tree methods (e.g., Slingshot [ 203 ], scTite [ 204 ], Monocle [ 205 ]), and graph methods [e.g., partition-based graph abstraction (PAGA) [ 206 ], rare cell type identification (RaceID) [ 129 ], selective locally linear inference of cellular expression relationships (SLICER) [ 207 ]]. Currently, the trajectory inference methods are maturing, particularly for the linear and bifurcating methods [ 191 ]. Based on a recent benchmarking study, guidelines for practical applications are given so that biomedical researchers can choose the appropriate methods according to prior knowledge on the expected topology in the data [ 191 ]; otherwise, PAGA, Monocle, RaceID, and Slingshot are recommended for an initial investigation.

Per existing biological knowledge on the starting point of inferred developmental or transition trajectory, cells along the trajectory can be ordered in a pseudo-temporal order. If there are bifurcation, multifurcation, or tree structures in the trajectory, multiple routes should be applied to go through tree branches separately. In this manner, it is easy to investigate gene expression dynamics along the pseudo time. Methods have been developed to conduct the trajectory-/pseudotime-based differential expression analysis [ 208 , 209 ], which may reveal the dynamic regulation of lineage/status specification.

An alternative way to capture transcriptome dynamics is to use RNA velocity, which is based on the relationship between matured and unmatured transcripts (i.e., with unspliced introns) in the same cell. If there are relatively more unspliced transcripts in a cell, the gene is under upregulation, and vice versa. Jointly quantifying the ratio between matured and unmatured transcripts, and the gene expression changes during status changes, the direction of cell transition can be thus determined [ 192 ]. This rationale has been realized in the first RNA velocity method Velocyto [ 210 ], and improved in the follow-up method scVelo, where a likelihood-based dynamical model was adopted [ 211 ]. Furthermore, recently developed methods [ 212 , 213 ] have combined RNA velocity with trajectory inference, resulting in directed trajectory inference independent of prior knowledge. For instance, CellRank takes advantage of both the robustness of trajectory inference and the directional information from RNA velocity, enabling the detection of previously unknown trajectories and cell states [ 212 ]. CellPath is another method integrating single-cell gene expression dynamics and RNA velocity information for trajectory inference [ 213 ].

Cell–cell communications

CCC events play important roles in organism development and homeostasis, as well as disease generation and progression. For example, tumor microenvironments are complex ecosystems composed of tumor cells, stromal cells and a variety of immune cells, such that abnormal or disrupted communication among these cells may promote tumor growth. To this end, various computational tools have been developed to infer CCC using scRNA-seq data [ 214 ]. The communication between cells commonly depends on ligand-receptor (LR) interactions, which are usually quantified by LR co-expression.

To facilitate the above investigation, known ligand-receptor interactions (LRIs) have been manually curated and deposited in databases (Fig.  3 a). To date, there are quite a few LRI databases, including CellPhoneDB [ 215 ], ICELLNET [ 216 ], CellTalkDB [ 217 ], SingleCellSignalR [ 218 ] and Omnipath [ 219 ]. The last updated CellPhoneDB (version 4) includes nearly 2000 high-confidence interactions between ligand and receptor proteins, as well as heteromeric protein complexes [ 215 , 220 ]. CellTalkDB is another comprehensive LRI database in humans and mouse, including 3398 human LR pairs and 2033 mouse LR pairs [ 217 ]. Meanwhile, scRNA-seq data are processed using methods mentioned previously for cell clustering and annotation (Fig.  3 b). Integrating the annotated scRNA-seq data with known LRIs, sample-specific LR scores are typically calculated, quantifying the interaction potential. Based on LR co-expression, there are a few categories of LR scoring functions [ 221 ], including expression thresholding, expression correlation, expression product, and a combination of differential expression [ 222 ]. For example, Camp et al. [ 223 ] only considered LR pairings if the expression values of both the ligand and receptor were above a certain threshold [log 2 (FPKM) ≥ 5]. By contrast, the method SingleCellSignalR is based on the product of LR gene expression levels [ 218 ].

figure 3

The data resources, computational pipelines, and visualization methods used for cell–cell communication (CCC) inference with scRNA-seq data. Typical analysis steps include the collection of ligand-receptor pairs ( a ), cell clustering and annotation in scRNA-seq ( b ), computational prediction of CCC ( c ), followed by results visualization and downstream analysis ( d ). The CCC inference tools can be categorized into three main classes: network-based, machine learning-based and spatial information-based approaches. LRI ligand-receptor interaction, scRNA-seq single-cell RNA sequencing, CCCExplorer cell–cell communication explorer, NATMI network analysis toolkit for multicellular interactions, histoCAT histology topography cytometry analysis toolbox, SoptSC similarity matrix-based optimization for single-cell data analysis, PyMINEr Python maximal information network exploration resource, Squidpy spatial quantification of molecular data in Python

Recently, computational methods for predicting CCC based on scRNA-seq data have been continuously developed [ 221 ]. The CCC inference tools can be categorized into three main classes according to their special features (Fig.  3 c), that is network-based, machine learning-based and spatial information-based approaches [ 221 ]. Network-based approaches, including NicheNet [ 224 ], cell–cell communication explorer (CCCExplorer) [ 225 ], scConnect [ 226 ] and network analysis toolkit for multicellular interactions (NATMI) [ 227 ], leverage the connection network between genes to predict CCC. For instance, NicheNet integrates single-cell expression data with prior knowledge of signaling pathways and gene regulatory networks [ 224 ], featured by the application of personalized PageRank algorithm, which was used to calculate ligand–target regulatory potential scores [ 228 ]. Various types of machine learning algorithms are adopted in the machine learning-based approaches, such as SingleCellSignalR [ 218 ], similarity matrix-based optimization for single-cell data analysis (SoptSC) [ 229 ] and Python maximal information network exploration resource (PyMINEr) [ 230 ]. Besides, reference component analysis (RCA)-CCA [ 231 ], linear regression [ 232 ] and decision tree classifiers [ 233 ] were also used for CCC prediction. Cell localization in space or spatial proximity between cells is the prerequisite of CCC; hence, accounting for spatial information would improve the accuracy of CCC inference. With the rapid development of spatial transcriptomics, many CCC inference approaches integrate scRNA-seq data with spatial transcriptomic and/or image data for identifying CCC. CellTalker scored communication among cell types by counting the number of LRIs, which was then assessed by spatial proximity between cells using image data [ 234 ]. In addition, spatial quantification of molecular data in Python (Squidpy) [ 235 ] and histology topography cytometry analysis toolbox (histoCAT) [ 236 ] provide analysis frameworks for spatial omics data, where intercellular communication can be investigated through cellular proximity or neighborhood analysis. Moreover, the authors of CellChat take the spatial information as the gold standard to evaluate different CCC inference approaches, and showed that CellChat performs better at predicting stronger interactions [ 237 ]. Finally, the inference results are usually visualized by heatmap, circus plot, Sankey plot and bubble plot (Fig.  3 d).

The emerging computational methods for identifying CCC have improved our understanding of the microenvironment for disease development. However, all the methods depend on prior knowledge of LRIs and statistical or machine learning models to predict potential CCC events. Alternatively choosing LRI resources and prediction approaches may result in different results, yet the impact of the choice on the results is largely unknown. To address this issue, one recent study systematically compared 16 resources and 7 methods for CCC inference, as well as the consensus of the compared methods [ 214 ]. The comparison demonstrated that different LRI resources covered a varying fraction of the collective prior knowledge, and the predicted CCC were largely inconsistent with each other, suggesting the need for continued efforts to improve CCC-inference resources and tools.

Regulon inference and TF activity prediction

TFs play essential roles in gene expression regulation, and are involved in various physiological and pathological processes of humans [ 238 ]. It has been realized in scRNA-seq to identify co-expression modules that were directly regulated by TFs of interest, and these modules were defined as regulons [ 188 ]. Therefore, it has been made possible to chart the cell type-specific regulons and to reconstruct regulation-based regulatory networks in individual cells (Fig.  4 ).

figure 4

Different strategies and approaches developed for regulon inference and TF activity prediction with scRNA-seq. To achieve regulon and TF activity prediction, the TF databases and TF-target databases are important resources, and the computational strategies include co-expression gene module identification, dynamic and stochastic modeling of TF versus target expression changes, and application of machine learning approaches. TF transcription factor, scRNA-seq single-cell RNA sequencing, AnimalTFDB Animal Transcription Factor DataBase, Cistrome DB Cistrome Data Browser, WGCNA weighted gene co-expression network analysis, SCENIC single cell regulatory network information and clustering, TRRUST transcriptional regulatory relationships unravelled by sentence-based text-mining

One important resource in recognizing regulons is the TF-target databases. The Animal Transcription Factor DataBase (AnimalTFDB) [ 239 ], JASPAR [ 240 ], transcriptional regulatory relationships unravelled by sentence-based text-mining (TRRUST) [ 241 ], KnockTF [ 242 ], and Cistrome Data Browser (Cistrome DB) [ 243 ] are widely applied TF annotation databases, covering most human and mouse TFs. Based on these databases, a simple way to build cell type-specific transcriptional regulatory networks is to identify up-regulated TFs and/or differentially expressed TF-target genes. For instance, a recent scRNA-seq study identified differentially expressed TFs based on AnimalTFDB TF annotation, and revealed that the reactivation of TFs expressed in fetal epithelium may be the cause of Crohn’s disease [ 244 ].

Integrating single-cell gene expression and the comprehensive TF-target information, there have been many methods developed for inferring regulons and TF activity. Coexpression analysis, such as weighted gene co-expression network analysis (WGCNA) [ 245 ], has been widely used in bulk samples to detect gene modules that likely are regulated by the same TF(s). Recently, this approach has also been applied to scRNA-seq data, to discover, for example, the gene modules whose expression changed significantly over the course of HIV infection [ 246 ]. The single cell regulatory network information and clustering (SCENIC) method is the earliest method for regulon inference based on scRNA-seq data [ 188 ], and has now been used to study regulatory networks of many diseases such as cancer and COVID-19 [ 247 , 248 ]. In SCENIC, co-expression modules between TFs and their target genes are first inferred with machine learning methods such as random forest regression, followed by regulon identification through TF’s binding motif analysis, and only their direct targets in the co-expression modules are kept to form the regulons. Finally, binarized scores are calculated to indicate TF’s activity in each cell. The other methods, including SCODE [ 249 ] and SINCERITIES [ 250 ], take advantage of the pseudo-temporal information reconstructed in scRNA-seq and infer TF-target regulatory networks based on ordinary differential equations or stochastic differential equation models. Moreover, machine learning techniques have also been applied for transcriptional regulation analysis. For example, while SIGNET [ 251 ] adopts multiple-layer perceptron bagging to identify regulons, DeepDRIM [ 252 ] utilizes supervised deep neural network to reconstruct gene regulatory networks. In particular, DeepDRIM is shown to be tolerant to dropout events in scRNA-seq and identify distinct regulatory networks of B cells in COVID-19 patients with mild and severe symptoms.

Despite many methods developed for gene regulation analysis based on scRNA-seq, a rigorous judgment on the inferred results needs to be made, due to the complexity of transcriptional regulation and the insufficient information provided by scRNA-seq data. Performing validation experiments may make the inferred results more solid [ 253 , 254 ].

Metabolic analysis

Metabolism is at the core of all biological processes, and metabolic dysregulation is a hallmark of many diseases including cancer, diabetes, and cardiovascular disease [ 255 ]. Although single-cell metabolomics technologies are under rapid development, they are now too premature for large-scale applications [ 256 ]. Instead, metabolic analysis based on single-cell transcriptomics is a promising alternative approach. For example, researchers may use scRNA-seq to monitor the gene expression changes of key metabolic genes under different treatments [ 257 ] or during important physiological/pathological processes [ 258 ].

The computational tools for scRNA-seq-based metabolic analysis can be classified into two major categories: pathway-based analysis and flux balance analysis (FBA)-based methods [ 256 ] (Fig.  5 ). For the first category, the standard functional enrichment analysis approaches are generally used (refer to the subsection entitled Functional enrichment analysis). In particular, the R package scMetabolism provides an integrated framework for quantitative analysis of metabolic pathway activity in scRNA-seq, with the ability to account for dropouts, and compatible with multiple tools designed for single-cell functional enrichment analysis [ 259 ], including ssGSEA [ 183 , 184 ], Vision [ 186 ], and AUCell [ 188 ].

figure 5

Two main types of metabolic analysis within scRNA-seq: pathway-based functional enrichment analysis and flux balance analysis of metabolic flow. While the former makes use of standard functional enrichment analysis, and the latter utilizes constraint-based mathematical models to systematically simulate metabolism in metabolic networks. Methods including scFBA, Compass, and scFEA employed different implementation strategies for flux balance analysis of metabolic flow. FBA flux balance analysis, KEGG Kyoto Encyclopedia of Genes and Genomes, UMAP uniform manifold approximation and projection, scRNA-seq single-cell RNA sequencing, scFBA single-cell flux balance analysis, scFEA single-cell flux estimation analysis, PCA principal component analysis

The other category is the FBA-based methods, where constraint-based mathematical models are utilized to systematically simulate metabolism in reconstructed metabolic networks [ 260 ]. The reconstruction of metabolic networks is usually based on curated databases, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) [ 261 ] and Reactome [ 262 ]; thereafter, FBA computes static metabolic fluxes in the system with constraints on the input and output fluxes satisfied [ 263 ]. Expression levels of individual enzymes in single cells may not directly affect metabolic fluxes in the networks, because they are mostly dependent on the network topology and constraints [ 256 ]. To our knowledge, single-cell flux balance analysis (scFBA) was the first computational tool that combines scRNA-seq data and FBA to estimate single-cell fluxomes [ 264 ]. Later, Compass [ 265 ] and single-cell flux estimation analysis (scFEA) [ 255 ] were proposed. Compass is based on Recon2’s reconstruction of human metabolism [ 266 ] and solves constraint-based optimization problems with linear programming, to score the potential activity of each metabolic reaction in individual cells [ 265 ]. By contrast, scFEA introduces a probabilistic model to consider the flux balance constraints, a multiplayer neural network to model the nonlinearity of flux changes and enzymatic gene expression changes, and a graph neural network to solve the optimization problem [ 255 ]. The analysis result by scFEA enables a variety of biologically meaningful downstream analysis, such as cell–cell metabolic communications.

A collected resource for scRNA-seq data analysis with biomedical applications

With the above overview of the analysis steps and tools for scRNA-seq data, this review may help biomedical researchers to design the data processing and analysis frameworks. However, it would still be challenging for researchers without a bioinformatics background to implement the analysis tasks for their data. For instance, scRNA-seq data analysis requires the installation of specific software tools and running through the scripts written in programming languages such as R and Python. To this end, we collected a range of widely-used software tools in scRNA-seq, and provided practical guidance for installing and running through the analysis with simple commands. The software collection, practical examples, brief description of the analysis results are available at https://github.com/WXlab-NJMU/scrna-recom . Notably, due to time and space constraints, we are unable to incorporate all popular tools into the analysis pipelines on the GitHub site; however, we provide a list of currently available tools with accessible links for users’ convenience (Additional file 1 : Table S1). We are also open to suggestions from the community and will adjust the pipelines accordingly. Currently, there are still a few research domains in scRNA-seq data analysis that are under positive development, we will keep updating related software and adjusting the scripts to implement the favorable progress made in these research domains.

Focusing on single-cell transcriptomics, we have reviewed almost all respects of typical analysis of scRNA-seq data, ranging from QC, basic data processing, to high-level analysis including trajectory inference, CCC estimation and metabolic analysis. To facilitate researchers conducting the analysis on their data, we have constructed an online software/script repertoire for these analysis steps, and will keep it updated to cover more research scenarios. We also offer a step-by-step command line interface (CLI) for wrapping up the R and Python scripts for scRNA-seq analysis. The step-wise commands can be flexibly combined and tailored for specific applications due to the diversity on scientific questions and experimental design. Moreover, incorporating cutting-edge technologies, the analysis steps reviewed above may not cover every specifically required task. Indeed, additional analysis pipeline ( https://github.com/WXlab-NJMU/scPolylox ) was necessary to process the scRNA-seq data for identifying Polylox transcript variants in lineage tracing [ 267 ].

In this review, we did not mention the task for gene expression imputation aiming to alleviate the impact of the well-known dropout issue in scRNA-seq [ 268 ]. This is because all the analyses reviewed in this article can be carried out without data imputation, and moreover one comparative study reported that the imputation results did not improve downstream analysis compared to no imputation [ 269 ]. Nevertheless, expression data imputation may help when the expression diversity of important genes or gene pairs needs to be investigated [ 270 ]. Additionally, the data integration step for removing the effect of covariants can also be optional. For instance, in a complex experimental design where tumor tissues and peritumor tissues are collected from liver cancer patients of different cancer subtypes, the strategies to integrate the datasets may be different depending on whether the common feature of the liver cancer or the subtype-specific feature is interesting.

Previous research has classified the downstream scRNA-seq data analysis methods into cell-level and gene-level analysis [ 56 ], which is intuitive and helpful for understanding. While cell-level analysis is typically concerned with the cell composition of given tissues or samples, gene-level analysis focuses on gene expression differences and heterogeneity. As a result, cell clustering for subpopulation identification, trajectory analysis, and CCC inference are examples of cell-level analysis, whereas differential expression, functional enrichment analysis, regulon inference, and metabolic flux analysis are primarily concerned with gene-level information. In contrast to bulk RNA-seq, single-cell RNA-seq allows for cell-level analysis with unprecedented accuracy and throughput, which in turn inspires a few types of gene-level analysis, such as marker gene identification and gene expression dynamics along inferred trajectories.

One more important point in scRNA-seq data analysis is data presentation and interpretation. Although there are no standard protocols for presenting and interpreting the analysis results, these procedures directly link the data with scientific conclusions. In particular, choosing the most appropriate plots would make the message conveyed more straightforwardly. For instance, if one wants to compare the expression levels of a particular gene between tumor and peritumor samples, violin plots showing the two distributions of the expression levels would be more appropriate than t-SNE or UMAP visualizing individual cells with color scales indicating the expression levels. Moreover, using t-SNE or UMAP visualization to compare the composition of cell origins (e.g., from tumor samples or peritumor samples) in a cell subtype of interest might be misleading, although it is more intuitive. This is because massive cells are usually profiled in a scRNA-seq experiment, and consequently cell points can be buried by some others in the two-dimensional visualization. Other types of plots that directly and more quantitatively demonstrate the composition would be more suitable.

Many other aspects of scRNA-seq data analysis are advancing rapidly. ScAPAtrap [ 271 ], Sierra [ 272 ], dynamic analysis of alternative polyadenylation (APA) from single-cell RNA-seq (scDaPars) [ 273 ], SCAPTURE [ 274 ], and single cell alternative polyadenylation using expectation–maximization (SCAPE) [ 275 ], for example, take advantage of the fact that sequencing reads in 3’ tag-based scRNA-seq are distributed near the polyadentation sites of individual transcripts to analyze alternative polyadentation and differential usage of 3’UTR isoforms between cells or cell types. Alternative UTR isoform usage is an important post-transcriptional regulatory mechanism in many physiological and pathological processes, affecting the rate of RNA degradation and the status of translation [ 276 , 277 ]. Currently, many research groups have been combining scRNA-seq with long-read sequencing technologies to enable high-confidence isoform profiling at the single-cell level [ 278 , 279 , 280 ]. Such studies have paved the way for the examination of alternative splicing and transcript fusions between cells and/or cell types, as well as during the progression of diseases [ 278 ].

In addition to gene expression regulation by TFs, trans-factors like RNA binding proteins (RBPs) and microRNAs typically bind to the 3’UTR of genes to modulate RNA stability, which also contributes to cellular RNA concentration. Based on collections of RBP and microRNA target genes [ 281 , 282 ], RBP and microRNA regulons can be investigated similarly to the TF regulons [ 283 ] in scRNA-seq. In fact, this kind of co-expression module-based analysis can be extended to the examination of cellular signaling pathway activities. Furthermore, in conjunction with CCC inference [ 214 ] and ligand–target regulatory potential scores [ 224 ], the activation of certain signaling pathways may also be inferred using scRNA-seq data.

Very recently, Live-seq has been developed to convert scRNA-seq from an end-point type assay to a temporal analysis workflow, by keeping cells alive while extracting RNA from individual cells [ 284 ]. It is anticipated that Live-seq will address a number of additional biological questions beyond scRNA-seq. In addition, other sequencing-based single-cell profiling technologies are under rapid development. Aiming at better understanding the dysregulation of altered gene expression in diseases conditions, single-cell assay for transposase-accessible chromatin using sequencing (ATAC-seq) [ 285 ], single-cell DNA methylation profiling [ 286 ], and single-cell Hi-C [ 287 ] are all useful to dissect the underlying regulatory mechanisms from different angles at the single-cell resolution. Algorithms have also been developed to integrate these multimodal single-cell data [ 63 ], capable of better resolving cell states and defining novel cell subtypes. Moreover, single-cell multi-omics approaches enable simultaneously profiling a couple of omics in identical cells [ 288 ], providing information on both regulatory elements and consequential gene expression levels for individual cells. The datasets generated by these technologies may help biomedical researchers to discover disease-specific regulatory programs, possibly in the subset of certain cell types [ 289 ]. Furthermore, although still in the developmental stage, spatial transcriptomics is a promising technique for considering the cellular context in characterizing molecular features of a particular cell [ 290 ]. With ever-increasing resolution in spatial transcriptomics, we anticipate gaining more in-depth knowledge in analyzing cell microenvironment and cell–cell interactions in health and disease. Collectively, with technologies continuously advancing, especially those that resolve molecular properties and interactions at the single-cell resolution, we will be able to better understand the pathogenesis of a variety of diseases and enable personalized therapies in the near future.

Availability of data and materials

The online repository of software and wrapped-up command line interface (CLI) is available at https://github.com/WXlab-NJMU/scrna-recom .

Abbreviations

Animal Transcription Factor DataBase

Automated cell type identification using neural networks

Alternative polyadenylation

Assay for transposase-accessible chromatin using sequencing

Batch balanced k nearest neighbours

Binary classification based doublet scoring

Bias-corrected sequencing analysis

Canonical correlation analysis

Cell growth correction

Characterization of cell types aided by hierarchical classification

Cistrome Data Browser

Clustering through imputation and dimensionality reduction

Command line interface

Coronavirus disease 2019

Co-expression based doublet scoring

Database for Annotation, Visualization, and Integrated Discovery

Differentially expressed genes

Digital gene expression

Diffusion pseudotime

Epithelial-to-mesenchymal transition

Expectation-maximization

Factorial single-cell latent variable model

Fast correction for ambient RNA

Flux balance analysis

Gene set enrichment analysis

Gene set variation analysis

Highly variable genes

Histology topography cytometry analysis toolbox

Iterative clustering and guide-gene selection

Jointly assessing signature mean and inferring enrichment

k-nearest-neighbor batch-effect test

Kyoto Encyclopedia of Genes and Genomes

Ligand-receptor

Ligand-receptor interaction

Linked inference of genomic experimental relationships

Model-based analysis of single-cell transcriptomics

Molecular Signatures Database

Multi-dimensional scaling

Mixtures of factor analysers

Mutual nearest neighbors

Network analysis toolkit for multicellular interactions

Non-negative matrix factorization

Ontology-based single cell classification

Partition-based graph abstraction

Peripheral blood mononuclear cells

Principal component analysis

Principal components

Python maximal information network exploration resource

Quality control

Rare cell type identification

Reference component analysis

RNA binding proteins

Semi-supervised category identification and assignment

Severe acute respiratory syndrome coronavirus 2

Single cell alternative polyadenylation using expectation–maximization

Single cell recognition

Single cell regulatory network information and clustering

Single-cell signature explorer

Single sample GSEA

Single-cell Aggregated Clustering via Mixture Model Ensemble clustering

Single-cell cluster-based automatic annotation toolkit for cellular heterogeneity

Single-cell consensus clustering

Dynamic analysis of APA from single-cell RNA-seq

Single-cell differential expression

Single-cell flux balance analysis

Single-cell flux estimation analysis

Single-cell latent variable model

Single-cell remover of doublets

Single-cell RNA sequencing

Variance-driven multitask clustering of scRNA-seq data

SAingle-cell interpretation via multikernel learning

Selective locally linear inference of cellular expression relationships

Shared nearest neighbor

Similarity matrix-based optimization for single-cell data analysis

Spatial quantification of molecular data in Python

Support vector machine

Tools for single cell analysis

t-distributed stochastic neighbor embedding

Transcription factor

Transcript per million

Trimmed mean of M values

Transcriptional regulatory relationships unravelled by sentence-based text-mining

Uniform manifold approximation and projection

Unique molecular identifier

Upper quartile

Weighted gene co-expression network analysis

Zero-inflated factor analysis

Sklavenitis-Pistofidis R, Getz G, Ghobrial I. Single-cell RNA sequencing: one step closer to the clinic. Nat Med. 2021;27(3):375–6.

Article   CAS   PubMed   Google Scholar  

Shapiro E, Biezuner T, Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet. 2013;14(9):618–30.

Kolodziejczyk AA, Kim JK, Svensson V, Marioni JC, Teichmann SA. The technology and biology of single-cell RNA sequencing. Mol Cell. 2015;58(4):610–20.

Nawy T. Single-cell sequencing. Nat Methods. 2014;11(1):18.

Griffiths JA, Scialdone A, Marioni JC. Using single-cell genomics to understand developmental processes and cell fate decisions. Mol Syst Biol. 2018;14(4):e8046.

Article   PubMed   PubMed Central   Google Scholar  

Briggs JA, Weinreb C, Wagner DE, Megason S, Peshkin L, Kirschner MW, et al. The dynamics of gene expression in vertebrate embryogenesis at single-cell resolution. Science. 2018;360(6392):eaar5780.

Jerby-Arnon L, Shah P, Cuoco MS, Rodman C, Su MJ, Melms JC, et al. A cancer cell program promotes T cell exclusion and resistance to checkpoint blockade. Cell. 2018;175(4):984-97.e24.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Kuppe C, Ibrahim MM, Kranz J, Zhang X, Ziegler S, Perales-Paton J, et al. Decoding myofibroblast origins in human kidney fibrosis. Nature. 2021;589(7841):281–6.

Bossel Ben-Moshe N, Hen-Avivi S, Levitin N, Yehezkel D, Oosting M, Joosten LaB, et al. Predicting bacterial infection outcomes using single cell RNA-sequencing analysis of human immune cells. Nat Commun. 2019;10(1):3266.

Li Y, Jin J, Bai F. Cancer biology deciphered by single-cell transcriptomic sequencing. Protein Cell. 2022;13(3):167–79.

Article   PubMed   Google Scholar  

Jia Q, Chu H, Jin Z, Long H, Zhu B. High-throughput single-cell sequencing in cancer research. Signal Transduct Target Ther. 2022;7(1):145.

Vladoiu MC, El-Hamamy I, Donovan LK, Farooq H, Holgado BL, Sundaravadanam Y, et al. Childhood cerebellar tumours mirror conserved fetal transcriptional programs. Nature. 2019;572(7767):67–73.

Blanpain C. Tracing the cellular origin of cancer. Nat Cell Biol. 2013;15(2):126–34.

Jin S, Li R, Chen MY, Yu C, Tang LQ, Liu YM, et al. Single-cell transcriptomic analysis defines the interplay between tumor cells, viral infection, and the microenvironment in nasopharyngeal carcinoma. Cell Res. 2020;30(11):950–65.

Pastushenko I, Brisebarre A, Sifrim A, Fioramonti M, Revenco T, Boumahdi S, et al. Identification of the tumour transition states occurring during EMT. Nature. 2018;556(7702):463–8.

Chung W, Eum HH, Lee HO, Lee KM, Lee HB, Kim KT, et al. Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer. Nat Commun. 2017;8:15081.

Kim J, Koo BK, Knoblich JA. Human organoids: model systems for human biology and medicine. Nat Rev Mol Cell Biol. 2020;21(10):571–84.

Wang R, Mao Y, Wang W, Zhou X, Wang W, Gao S, et al. Systematic evaluation of colorectal cancer organoid system by single-cell RNA-Seq analysis. Genome Biol. 2022;23(1):106.

Wu H, Uchimura K, Donnelly EL, Kirita Y, Morris SA, Humphreys BD. Comparative analysis and refinement of human PSC-derived kidney organoid differentiation with single-cell transcriptomics. Cell Stem Cell. 2018;23(6):869-81.e8.

Neal JT, Li X, Zhu J, Giangarra V, Grzeskowiak CL, Ju J, et al. Organoid modeling of the tumor immune microenvironment. Cell. 2018;175(7):1972-88.e16.

Vlachogiannis G, Hedayat S, Vatsiou A, Jamin Y, Fernandez-Mateos J, Khan K, et al. Patient-derived organoids model treatment response of metastatic gastrointestinal cancers. Science. 2018;359(6378):920–6.

Broutier L, Mastrogiovanni G, Verstegen MM, Francies HE, Gavarro LM, Bradshaw CR, et al. Human primary liver cancer-derived organoid cultures for disease modeling and drug screening. Nat Med. 2017;23(12):1424–35.

Krieger TG, Le Blanc S, Jabs J, Ten FW, Ishaque N, Jechow K, et al. Single-cell analysis of patient-derived PDAC organoids reveals cell state heterogeneity and a conserved developmental hierarchy. Nat Commun. 2021;12(1):5826.

Guillen KP, Fujita M, Butterfield AJ, Scherer SD, Bailey MH, Chu Z, et al. A human breast cancer-derived xenograft and organoid platform for drug discovery and precision oncology. Nat Cancer. 2022;3(2):232–50.

Ziegler CGK, Allon SJ, Nyquist SK, Mbano IM, Miao VN, Tzouanas CN, et al. SARS-CoV-2 receptor ACE2 is an interferon-stimulated gene in human airway epithelial cells and is detected in specific cell subsets across tissues. Cell. 2020;181(5):1016-35.e19.

Stephenson E, Reynolds G, Botting RA, Calero-Nieto FJ, Morgan MD, Tuong ZK, et al. Single-cell multi-omics analysis of the immune response in COVID-19. Nat Med. 2021;27(5):904–16.

Tian Y, Carpp LN, Miller HER, Zager M, Newell EW, Gottardo R. Single-cell immunology of SARS-CoV-2 infection. Nat Biotechnol. 2022;40(1):30–41.

Melms JC, Biermann J, Huang H, Wang Y, Nair A, Tagore S, et al. A molecular single-cell lung atlas of lethal COVID-19. Nature. 2021;595(7865):114–9.

Zhang X, Li T, Liu F, Chen Y, Yao J, Li Z, et al. Comparative analysis of droplet-based ultra-high-throughput single-cell RNA-seq systems. Mol Cell. 2019;73(1):130-42.e5.

Wang X, He Y, Zhang Q, Ren X, Zhang Z. Direct comparative analyses of 10x genomics chromium and Smart-seq2. Genomics Proteom Bioinform. 2021;19(2):253–66.

Article   CAS   Google Scholar  

Wu F, Fan J, He Y, Xiong A, Yu J, Li Y, et al. Single-cell profiling of tumor heterogeneity and the microenvironment in advanced non-small cell lung cancer. Nat Commun. 2021;12(1):2540.

Xu K, Wang R, Xie H, Hu L, Wang C, Xu J, et al. Single-cell RNA sequencing reveals cell heterogeneity and transcriptome profile of breast cancer lymph node metastasis. Oncogenesis. 2021;10(10):66.

Haque A, Engel J, Teichmann SA, Lonnberg T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med. 2017;9(1):75.

Lafzi A, Moutinho C, Picelli S, Heyn H. Tutorial: guidelines for the experimental design of single-cell RNA sequencing studies. Nat Protoc. 2018;13(12):2742–57.

Kinker GS, Greenwald AC, Tal R, Orlova Z, Cuoco MS, Mcfarland JM, et al. Pan-cancer single-cell RNA-seq identifies recurring programs of cellular heterogeneity. Nat Genet. 2020;52(11):1208–18.

Suva ML, Tirosh I. Single-cell RNA sequencing in cancer: lessons learned and emerging challenges. Mol Cell. 2019;75(1):7–12.

Ramachandran P, Matchett KP, Dobie R, Wilson-Kanamori JR, Henderson NC. Single-cell technologies in hepatology: new insights into liver biology and disease pathogenesis. Nat Rev Gastroenterol Hepatol. 2020;17(8):457–72.

Ni J, Wang X, Stojanovic A, Zhang Q, Wincher M, Buhler L, et al. Single-cell RNA sequencing of tumor-infiltrating NK cells reveals that inhibition of transcription factor HIF-1α unleashes NK cell activity. Immunity. 2020;52(6):1075-87.e8.

Zheng C, Zheng L, Yoo JK, Guo H, Zhang Y, Guo X, et al. Landscape of infiltrating T cells in liver cancer revealed by single-cell sequencing. Cell. 2017;169(7):1342-56.e16.

Wilk AJ, Rustagi A, Zhao NQ, Roque J, Martinez-Colon GJ, Mckechnie JL, et al. A single-cell atlas of the peripheral immune response in patients with severe COVID-19. Nat Med. 2020;26(7):1070–6.

Wang Z, Xie L, Ding G, Song S, Chen L, Li G, et al. Single-cell RNA sequencing of peripheral blood mononuclear cells from acute Kawasaki disease patients. Nat Commun. 2021;12(1):5444.

Clevers H. Modeling development and disease with organoids. Cell. 2016;165(7):1586–97.

Salahudeen AA, Choi SS, Rustagi A, Zhu J, van Unen V, de la OS, et al. Progenitor identification and SARS-CoV-2 infection in human distal lung organoids. Nature. 2020;588(7839):670–5.

Perez RK, Gordon MG, Subramaniam M, Kim MC, Hartoularos GC, Targ S, et al. Single-cell RNA-seq reveals cell type-specific molecular and genetic associations to lupus. Science. 2022;376(6589):eabf1970.

Ernster VL. Nested case-control studies. Prev Med. 1994;23(5):587–90.

Mandric I, Schwarz T, Majumdar A, Hou K, Briscoe L, Perez R, et al. Optimized design of single-cell RNA sequencing experiments for cell-type-specific eQTL analysis. Nat Commun. 2020;11(1):5504.

Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21.

Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017;27(3):491–9.

Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049.

Tian L, Su S, Dong X, Amann-Zalcenstein D, Biben C, Seidi A, et al. scPipe: a flexible R/Bioconductor preprocessing pipeline for single-cell RNA-sequencing data. PLoS Comput Biol. 2018;14(8):e1006361.

Parekh S, Ziegenhain C, Vieth B, Enard W, Hellmann I. zUMIs—a fast and flexible pipeline to process RNA sequencing data with UMIs. Gigascience. 2018;7(6):giy059.

Hashimshony T, Senderovich N, Avital G, Klochendler A, de Leeuw Y, Anavy L, et al. CEL-Seq2: sensitive highly-multiplexed single-cell RNA-Seq. Genome Biol. 2016;17:77.

Melsted P, Booeshaghi AS, Liu L, Gao F, Lu L, Min KHJ, et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat Biotechnol. 2021;39(7):813–8.

Wang Z, Hu J, Johnson WE, Campbell JD. scruff: an R/Bioconductor package for preprocessing single-cell RNA-sequencing data. BMC Bioinform. 2019;20(1):222.

Article   Google Scholar  

You Y, Tian L, Su S, Dong X, Jabbari JS, Hickey PF, et al. Benchmarking UMI-based single-cell RNA-seq preprocessing workflows. Genome Biol. 2021;22(1):339.

Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15(6):e8746.

Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16(3):133–45.

Brennecke P, Anders S, Kim JK, Kolodziejczyk AA, Zhang X, Proserpio V, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat Methods. 2013;10(11):1093–5.

Andrews TS, Kiselev VY, Mccarthy D, Hemberg M. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nat Protoc. 2021;16(1):1–9.

Ilicic T, Kim JK, Kolodziejczyk AA, Bagger FO, Mccarthy DJ, Marioni JC, et al. Classification of low quality cells from single-cell RNA-seq data. Genome Biol. 2016;17:29.

Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411–20.

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM 3rd, et al. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888-902.e21.

Hao Y, Hao S, Andersen-Nissen E, Mauck WM 3rd, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573-87.e29.

Mccarthy DJ, Campbell KR, Lun AT, Wills QF. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics. 2017;33(8):1179–86.

CAS   PubMed   PubMed Central   Google Scholar  

Guimaraes JC, Zavolan M. Patterns of ribosomal protein expression specify normal and malignant human cells. Genome Biol. 2016;17(1):236.

Oelen R, de Vries DH, Brugge H, Gordon MG, Vochteloo M, Ye CJ, et al. Single-cell RNA-sequencing of peripheral blood mononuclear cells reveals widespread, context-specific gene expression regulation upon pathogenic exposure. Nat Commun. 2022;13(1):3267.

Zhong S, Ding W, Sun L, Lu Y, Dong H, Fan X, et al. Decoding the development of the human hippocampus. Nature. 2020;577(7791):531–6.

Young MD, Behjati S. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. Gigascience. 2020;9(12):giaa151.

Yang S, Corbett SE, Koga Y, Wang Z, Johnson WE, Yajima M, et al. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol. 2020;21(1):57.

Berg M, Petoukhov I, Van Den Ende I, Meyer KB, Guryev V, Vonk JM, et al. FastCAR: fast correction for Ambient RNA to facilitate differential gene expression analysis in single-cell RNA-sequencing datasets. bioRxiv. 2022. https://doi.org/10.1101/2022.07.19.500594

Fleming SJ, Chaffin MD, Arduini A, Akkad AD, Banks E, Marioni JC, et al. Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. bioRxiv. 2022. https://doi.org/10.1101/791699 .

Xi NM, Li JJ. Benchmarking computational doublet-detection methods for single-cell RNA sequencing data. Cell Syst. 2021;12(2):176-94.e6.

Bernstein NJ, Fong NL, Lam I, Roy MA, Hendrickson DG, Kelley DR. Solo: doublet identification in single-cell RNA-Seq via semi-supervised deep learning. Cell Syst. 2020;11(1):95-101.e5.

Wolock SL, Lopez R, Klein AM. Scrublet: computational identification of cell doublets in single-cell transcriptomic data. Cell Syst. 2019;8(4):281-91.e9.

Lun AT, Mccarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 2016;5:2122.

Bais AS, Kostka D. scds: computational annotation of doublets in single-cell RNA sequencing data. Bioinformatics. 2020;36(4):1150–8.

Park J, Choi W, Tiesmeyer S, Long B, Borm LE, Garren E, et al. Cell segmentation-free inference of cell types from in situ transcriptomics data. Nat Commun. 2021;12(1):3545.

McGinnis CS, Murrow LM, Gartner ZJ. DoubletFinder: doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. Cell Syst. 2019;8(4):329-37.e4.

DePasquale EAK, Schnell DJ, Van Camp PJ, Valiente-Alandi I, Blaxall BC, Grimes HL, et al. DoubletDecon: deconvoluting doublets from single-cell RNA-sequencing data. Cell Rep. 2019;29(6):1718-27.e8.

Deeke JM, Gagnon-Bartsch JA. Stably expressed genes in single-cell RNA sequencing. J Bioinform Comput Biol. 2020;18(1):2040004.

Vallejos CA, Risso D, Scialdone A, Dudoit S, Marioni JC. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat Methods. 2017;14(6):565–71.

Finak G, Mcdavid A, Yajima M, Deng J, Gersuk V, Shalek AK, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 2015;16:278.

Grun D, van Oudenaarden A. Design and analysis of single-cell sequencing experiments. Cell. 2015;163(4):799–810.

Kharchenko PV, Silberstein L, Scadden DT. Bayesian approach to single-cell differential expression analysis. Nat Methods. 2014;11(7):740–2.

Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010;26(4):493–500.

Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinform. 2010;11:94.

Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.

Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106.

Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016;17:75.

Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Single-cell mRNA quantification and differential analysis with Census. Nat Methods. 2017;14(3):309–15.

Buttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nat Methods. 2019;16(1):43–9.

Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019;20(1):296.

Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, et al. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33(2):155–60.

Vento-Tormo R, Efremova M, Botting RA, Turco MY, Vento-Tormo M, Meyer KB, et al. Single-cell reconstruction of the early maternal-fetal interface in humans. Nature. 2018;563(7731):347–53.

Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14.

Buettner F, Pratanwanich N, McCarthy DJ, Marioni JC, Stegle O. f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq. Genome Biol. 2017;18(1):212.

Blasi T, Buettner F, Strasser MK, Marr C, Theis FJ. cgCorrect: a method to correct for confounding cell–cell variation due to cell growth in single-cell transcriptomics. Phys Biol. 2017;14(3): 036001.

Kanton S, Boyle MJ, He Z, Santel M, Weigert A, Sanchis-Calleja F, et al. Organoid single-cell genomic atlas uncovers human-specific features of brain development. Nature. 2019;574(7778):418–22.

Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, Goh M, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020;21(1):12.

Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.

Smyth GK, Speed T. Normalization of cDNA microarray data. Methods. 2003;31(4):265–73.

Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol. 2018;36(5):421–7.

Hie B, Bryson B, Berger B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat Biotechnol. 2019;37(6):685–91.

Polański K, Young MD, Miao Z, Meyer KB, Teichmann SA, Park JE. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics. 2020;36(3):964–5.

PubMed   Google Scholar  

Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019;16(12):1289–96.

Welch JD, Kozareva V, Ferreira A, Vanderburg C, Martin C, Macosko EZ. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell. 2019;177(7):1873-87.e17.

Lotfollahi M, Wolf FA, Theis FJ. scGen predicts single-cell perturbation responses. Nat Methods. 2019;16(8):715–21.

Argelaguet R, Cuomo ASE, Stegle O, Marioni JC. Computational principles and challenges in single-cell data integration. Nat Biotechnol. 2021;39(10):1202–15.

Grun D, Kester L, Van Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Methods. 2014;11(6):637–40.

Su K, Yu T, Wu H. Accurate feature selection improves single-cell RNA-seq cell clustering. Brief Bioinform. 2021;22(5):bbab034.

Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol. 2019;20(1):295.

Yang P, Huang H, Liu C. Feature selection revisited in the single-cell era. Genome Biol. 2021;22(1):321.

Yip SH, Sham PC, Wang J. Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data. Brief Bioinform. 2019;20(4):1583–9.

Andrews TS, Hemberg M. M3Drop: dropout-based feature selection for scRNASeq. Bioinformatics. 2019;35(16):2865–7.

Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201.

Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2019;20(1):269.

Ringner M. What is principal component analysis? Nat Biotechnol. 2008;26(3):303–4.

Shao C, Hofer T. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics. 2017;33(2):235–42.

Tzeng J, Lu HH, Li WH. Multidimensional scaling for large genomic data sets. BMC Bioinform. 2008;9:179.

Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nat Commun. 2019;10(1):5416.

Becht E, Mcinnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37(1):38–44.

Gogolewski K, Sykulski M, Chung NC, Gambin A. Truncated robust principal component analysis and noise reduction for single cell RNA sequencing data. J Comput Biol. 2019;26(8):782–93.

Tsuyuzaki K, Sato H, Sato K, Nikaido I. Benchmarking principal component analysis for large-scale single-cell RNA-sequencing. Genome Biol. 2020;21(1):9.

Chung NC, Storey JD. Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics. 2015;31(4):545–54.

Pierson E, Yau C. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 2015;16:241.

Shi J, Luo Z. Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples. Comput Biol Med. 2010;40(8):723–32.

Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Brief Bioinform. 2020;21(4):1209–23.

van Unen V, Li N, Molendijk I, Temurhan M, Hollt T, van der Meulen-de Jong AE, et al. Mass cytometry of the human mucosal immune system identifies tissue- and disease-associated immune subsets. Immunity. 2016;44(5):1227–39.

Grun D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525(7568):251–5.

Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science. 2014;343(6172):776–9.

Zhang W, Xue X, Zheng X, Fan Z. NMFLRR: clustering scRNA-Seq Data by integrating nonnegative matrix factorization with low rank representation. IEEE J Biomed Health Inform. 2022;26(3):1394–405.

Zheng R, Li M, Liang Z, Wu FX, Pan Y, Wang J. SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation. Bioinformatics. 2019;35(19):3642–50.

Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14(5):483–6.

Levine JH, Simonds EF, Bendall SC, Davis KL, El Amir AD, Tadmor MD, et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell. 2015;162(1):184–97.

Zeisel A, Munoz-Manchado AB, Codeluppi S, Lonnerberg P, La Manno G, Jureus A, et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015;347(6226):1138–42.

Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods. 2017;14(4):414–6.

Lin P, Troup M, Ho JW. CIDR: ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol. 2017;18(1):59.

Huh R, Yang Y, Jiang Y, Shen Y, Li Y. SAME-clustering: single-cell aggregated clustering via mixture model ensemble. Nucleic Acids Res. 2020;48(1):86–95.

Duo A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 2018;7:1141.

Freytag S, Tian L, Lonnstedt I, Ng M, Bahlo M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Res. 2018;7:1297.

Sun X, Lin X, Li Z, Wu H. A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq. Brief Bioinform. 2022;23(2):bbab567.

Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJT, et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 2019;20(1):194.

Huang Q, Liu Y, Du Y, Garmire LX. Evaluation of cell type annotation R packages on single-cell RNA-seq data. Genom Proteom Bioinform. 2021;19(2):267–81.

Zhang AW, O'flanagan C, Chavez EA, Lim JLP, Ceglia N, Mcpherson A, et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat Methods. 2019;16(10):1007–15.

Kiselev VY, Yiu A, Hemberg M. scmap: projection of single-cell RNA-seq data across data sets. Nat Methods. 2018;15(5):359–62.

Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol. 2019;20(2):163–72.

de Kanter JK, Lijnzaad P, Candelli T, Margaritis T, Holstege FCP. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 2019;47(16):e95.

Tan Y, Cahan P. SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species. Cell Syst. 2019;9(2):207-13.e2.

Jiang L, Chen H, Pinello L, Yuan GC. GiniClust: detecting rare cell types from single-cell gene expression data with Gini index. Genome Biol. 2016;17(1):144.

Guo M, Wang H, Potter SS, Whitsett JA, Xu Y. SINCERA: a pipeline for single-cell RNA-Seq profiling analysis. PLoS Comput Biol. 2015;11(11):e1004575.

Zhang JM, Fan J, Fan HC, Rosenfeld D, Tse DN. An interpretable framework for clustering single-cell RNA-seq datasets. BMC Bioinform. 2018;19(1):93.

Pasquini G, Rojo Arias JE, Schäfer P, Busskamp V. Automated methods for cell type annotation on scRNA-seq data. Comput Struct Biotechnol J. 2021;19:961–9.

Zhang X, Lan Y, Xu J, Quan F, Zhao E, Deng C, et al. Cell Marker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 2019;47(D1):D721–8.

Franzén O, Gan LM, Björkegren JLM. Panglao DB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database. 2019;2019:baz046.

Xu M, Bai X, Ai B, Zhang G, Song C, Zhao J, et al. TF-Marker: a comprehensive manually curated database for transcription factors and related markers in specific cell and tissue types in human. Nucleic Acids Res. 2022;50(D1):D402–12.

CAS   PubMed   Google Scholar  

Ianevski A, Giri AK, Aittokallio T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun. 2022;13(1):1246.

Guo H, Li J. scSorter: assigning cells to known cell types according to marker genes. Genome Biol. 2021;22(1):69.

Zhang Z, Luo D, Zhong X, Choi JH, Ma Y, Wang S, et al. SCINA: a semi-supervised subtyping algorithm of single cells and bulk samples. Genes. 2019;10(7):531.

Shao X, Liao J, Lu X, Xue R, Ai N, Fan X. scCATCH: automatic annotation on cell types of clusters from single-cell RNA sequencing data. iScience. 2020;23(3):100882.

Hou R, Denisenko E, Forrest ARR. scMatch: a single-cell gene expression profile annotation tool using reference datasets. Bioinformatics. 2019;35(22):4688–95.

Stunnenberg HG, International Human Epigenome C, Hirst M. The international human epigenome consortium: a blueprint for scientific collaboration and discovery. Cell. 2016;167(5):1145–9.

Consortium EP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74.

Mabbott NA, Baillie JK, Brown H, Freeman TC, Hume DA. An expression atlas of human primary cells: inference of gene function from coexpression networks. BMC Genomics. 2013;14:632.

Ma F, Pellegrini M. ACTINN: automated identification of cell types in single cell RNA sequencing. Bioinformatics. 2020;36(2):533–8.

Alquicira-Hernandez J, Sathe A, Ji HP, Nguyen Q, Powell JE. scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome Biol. 2019;20(1):264.

Lin Y, Cao Y, Kim HJ, Salim A, Speed TP, Lin DM, et al. scClassify: sample size estimation and multiscale classification of cells using single and multiple reference. Mol Syst Biol. 2020;16(6): e9389.

Wang S, Pisco AO, Mcgeever A, Brbic M, Zitnik M, Darmanis S, et al. Leveraging the cell ontology to classify unseen cell types. Nat Commun. 2021;12(1):5556.

Jiang T, Zhou W, Sheng Q, Yu J, Xie Y, Ding N, et al. ImmCluster: an ensemble resource for immunology cell type clustering and annotations in normal and cancerous tissues. Nucleic Acids Res. 2022;22:gkac922.

Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 2015;33(5):495–502.

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550.

Robinson MD, Mccarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.

Li Y, Ge X, Peng F, Li W, Li JJ. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biol. 2022;23(1):79.

Miao Z, Deng K, Wang X, Zhang X. DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics. 2018;34(18):3223–4.

Olsson A, Venkatasubramanian M, Chaudhri VK, Aronow BJ, Salomonis N, Singh H, et al. Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature. 2016;537(7622):698–702.

Zhang H, Lee CaA, Li Z, Garbe JR, Eide CR, Petegrosso R, et al. A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa. PLoS Comput Biol. 2018;14(4):e1006053.

Vu TN, Wills QF, Kalari KR, Niu N, Wang L, Rantalainen M, et al. Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics. 2016;32(14):2128–35.

Chen L, Zheng S. BCseq: accurate single cell RNA-seq quantification with bias correction. Nucleic Acids Res. 2018;46(14):e82.

Soneson C, Robinson MD. Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods. 2018;15(4):255–61.

Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003;4(5):P3.

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545–50.

Wang X, Cairns MJ. SeqGSEA: a Bioconductor package for gene set enrichment analysis of RNA-Seq data integrating differential expression and splicing. Bioinformatics. 2014;30(12):1777–9.

Liberzon A, Birger C, Thorvaldsdottir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1(6):417–25.

Hanzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. 2013;14:7.

Jin Y, Wang Z, He D, Zhu Y, Chen X, Cao K. Identification of novel subtypes based on ssGSEA in immune-related prognostic signature for tongue squamous cell carcinoma. Cancer Med. 2021;10(23):8693–707.

Zhang Y, Ma Y, Huang Y, Zhang Y, Jiang Q, Zhou M, et al. Benchmarking algorithms for pathway activity transformation of single-cell RNA-seq data. Comput Struct Biotechnol J. 2020;18:2953–61.

Detomaso D, Jones MG, Subramaniam M, Ashuach T, Ye CJ, Yosef N. Functional interpretation of single cell similarity maps. Nat Commun. 2019;10(1):4376.

Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, Herman JL, et al. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods. 2016;13(3):241–4.

Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods. 2017;14(11):1083–6.

Pont F, Tosolini M, Fournie JJ. Single-Cell Signature Explorer for comprehensive visualization of single cell signatures across scRNA-seq datasets. Nucleic Acids Res. 2019;47(21):e133.

Noureen N, Ye Z, Chen Y, Wang X, Zheng S. Signature-scoring methods developed for bulk samples are not adequate for cancer single-cell RNA sequencing data. Elife. 2022;11:e71994.

Saelens W, Cannoodt R, Todorov H, Saeys Y. A comparison of single-cell trajectory inference methods. Nat Biotechnol. 2019;37(5):547–54.

Ding J, Sharon N, Bar-Joseph Z. Temporal modelling using single-cell transcriptomics. Nat Rev Genet. 2022;23(6):355–68.

Bergen V, Soldatov RA, Kharchenko PV, Theis FJ. RNA velocity-current challenges and future perspectives. Mol Syst Biol. 2021;17(8):e10282.

Schlitzer A, Sivakamasundari V, Chen J, Sumatoh HR, Schreuder J, Lum J, et al. Identification of cDC1- and cDC2-committed DC progenitors reveals early lineage priming at the common DC progenitor stage in the bone marrow. Nat Immunol. 2015;16(7):718–28.

Cannoodt R, Saelens W, Sichien D, Tavernier S, Janssens S, Guilliams M, et al. SCORPIUS improves trajectory inference and identifies novel modules in dendritic cell development. bioRxiv. 2016. https://doi.org/10.1101/079509

Ji Z, Ji H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016;44(13):e117.

Bendall SC, Davis KL, El Amir AD, Tadmor MD, Simonds EF, Chen TJ, et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell. 2014;157(3):714–25.

Haghverdi L, Buttner M, Wolf FA, Buettner F, Theis FJ. Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods. 2016;13(10):845–8.

Setty M, Tadmor MD, Reich-Zeliger S, Angel O, Salame TM, Kathail P, et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat Biotechnol. 2016;34(6):637–45.

Herman JS, Sagar D, Grun D. FateID infers cell fate bias in multipotent progenitors from single-cell RNA-seq data. Nat Methods. 2018;15(5):379–86.

Velten L, Haas SF, Raffel S, Blaszkiewicz S, Islam S, Hennig BP, et al. Human haematopoietic stem cell lineage commitment is a continuous process. Nat Cell Biol. 2017;19(4):271–81.

Campbell KR, Yau C. Probabilistic modeling of bifurcations in single-cell gene expression data using a Bayesian mixture of factor analyzers. Wellcome Open Res. 2017;2:19.

Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics. 2018;19(1):477.

Gan Y, Guo C, Guo W, Xu G, Zou G. Entropy-based inference of transition states and cellular trajectory for single-cell transcriptomics. Brief Bioinform. 2022;23(4):bbac225.

Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019;566(7745):496–502.

Wolf FA, Hamey FK, Plass M, Solana J, Dahlin JS, Gottgens B, et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 2019;20(1):59.

Welch JD, Hartemink AJ, Prins JF. SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol. 2016;17(1):106.

Van den Berge K, Roux De Bézieux H, Street K, Saelens W, Cannoodt R, Saeys Y, et al. Trajectory-based differential expression analysis for single-cell sequencing data. Nat Commun. 2020;11(1):1201.

Song D, Li JJ. PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated P -values from single-cell RNA sequencing data. Genome Biol. 2021;22(1):124.

La Manno G, Soldatov R, Zeisel A, Braun E, Hochgerner H, Petukhov V, et al. RNA velocity of single cells. Nature. 2018;560(7719):494–8.

Bergen V, Lange M, Peidli S, Wolf FA, Theis FJ. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat Biotechnol. 2020;38(12):1408–14.

Lange M, Bergen V, Klein M, Setty M, Reuter B, Bakhti M, et al. Cell Rank for directed single-cell fate mapping. Nat Methods. 2022;19(2):159–70.

Zhang Z, Zhang X. Inference of high-resolution trajectories in single-cell RNA-seq data by using RNA velocity. Cell Rep Methods. 2021;1(6):100095.

Dimitrov D, Türei D, Garrido-Rodriguez M, Burmedi PL, Nagai JS, Boys C, et al. Comparison of methods and resources for cell–cell communication inference from single-cell RNA-seq data. Nat Commun. 2022;13(1):3224.

Efremova M, Vento-Tormo M, Teichmann SA, Vento-Tormo R. Cell PhoneDB: inferring cell–cell communication from combined expression of multi-subunit ligand-receptor complexes. Nat Protoc. 2020;15(4):1484–506.

Noël F, Massenet-Regad L, Carmi-Levy I, Cappuccio A, Grandclaudon M, Trichot C, et al. Dissection of intercellular communication using the transcriptome-based framework ICELLNET. Nat Commun. 2021;12(1):1089.

Shao X, Liao J, Li C, Lu X, Cheng J, Fan X. CellTalkDB: a manually curated database of ligand-receptor interactions in humans and mice. Brief Bioinform. 2021;22(4):bbaa269.

Cabello-Aguilar S, Alame M, Kon-Sun-Tack F, Fau C, Lacroix M, Colinge J. SingleCellSignalR: inference of intercellular networks from single-cell transcriptomics. Nucleic Acids Res. 2020;48(10):e55.

Turei D, Valdeolivas A, Gul L, Palacio-Escat N, Klein M, Ivanova O, et al. Integrated intra- and intercellular signaling knowledge for multicellular omics analysis. Mol Syst Biol. 2021;17(3):e9923.

Garcia-Alonso L, Lorenzi V, Mazzeo CI, Alves-Lopes JP, Roberts K, Sancho-Serra C, et al. Single-cell roadmap of human gonadal development. Nature. 2022;607(7919):540–7.

Peng L, Wang F, Wang Z, Tan J, Huang L, Tian X, et al. Cell–cell communication inference and analysis in the tumour microenvironments from single-cell transcriptomics: data resources and computational strategies. Brief Bioinform. 2022;23(4):bbac234.

Armingol E, Officer A, Harismendy O, Lewis NE. Deciphering cell–cell interactions and communication from gene expression. Nat Rev Genet. 2021;22(2):71–88.

Camp JG, Sekine K, Gerber T, Loeffler-Wirth H, Binder H, Gac M, et al. Multilineage communication regulates human liver bud development from pluripotency. Nature. 2017;546(7659):533–8.

Browaeys R, Saelens W, Saeys Y. NicheNet: modeling intercellular communication by linking ligands to target genes. Nat Methods. 2020;17(2):159–62.

Choi H, Sheng J, Gao D, Li F, Durrans A, Ryu S, et al. Transcriptome analysis of individual stromal cell populations identifies stroma-tumor crosstalk in mouse lung cancer model. Cell Rep. 2015;10(7):1187–201.

Jakobsson JET, Spjuth O, Lagerström MC. scConnect: a method for exploratory analysis of cell–cell communication based on single cell RNA sequencing data. Bioinformatics. 2021;37(20):3501–8.

Hou R, Denisenko E, Ong HT, Ramilowski JA, Forrest ARR. Predicting cell-to-cell communication networks using NATMI. Nat Commun. 2020;11(1):5011.

Lamurias A, Ruas P, Couto FM. PPR-SSM: personalized PageRank and semantic similarity measures for entity linking. BMC Bioinform. 2019;20(1):534.

Wang S, Karikomi M, Maclean AL, Nie Q. Cell lineage and communication network inference via optimization for single-cell transcriptomics. Nucleic Acids Res. 2019;47(11):e66.

Tyler SR, Rotti PG, Sun X, Yi Y, Xie W, Winter MC, et al. PyMINEr finds gene and autocrine-paracrine networks from human islet scRNA-seq. Cell Rep. 2019;26(7):1951-64.e8.

Lee HO, Hong Y, Etlioglu HE, Cho YB, Pomella V, Van den Bosch B, et al. Lineage-dependent gene expression programs influence the immune landscape of colorectal cancer. Nat Genet. 2020;52(6):594–603.

Zhou JX, Taramelli R, Pedrini E, Knijnenburg T, Huang S. Extracting intercellular signaling network of cancer tissues using ligand-receptor expression patterns from whole-tumor and single-cell transcriptomes. Sci Rep. 2017;7(1):8815.

Kumar MP, Du J, Lagoudas G, Jiao Y, Sawyer A, Drummond DC, et al. Analysis of single-cell RNA-seq identifies cell–cell communication associated with tumor characteristics. Cell Rep. 2018;25(6):1458–68e4.

Cillo AR, Kürten CHL, Tabib T, Qi Z, Onkar S, Wang T, et al. Immune landscape of viral- and carcinogen-driven head and neck cancer. Immunity. 2020;52(1):183-99.e9.

Palla G, Spitzer H, Klein M, Fischer D, Schaar AC, Kuemmerle LB, et al. Squidpy: a scalable framework for spatial omics analysis. Nat Methods. 2022;19(2):171–8.

Schapiro D, Jackson HW, Raghuraman S, Fischer JR, Zanotelli VRT, Schulz D, et al. histoCAT: analysis of cell phenotypes and interactions in multiplex image cytometry data. Nat Methods. 2017;14(9):873–6.

Jin S, Guerrero-Juarez CF, Zhang L, Chang I, Ramos R, Kuan CH, et al. Inference and analysis of cell–cell communication using Cell Chat. Nat Commun. 2021;12(1):1088.

Lambert SA, Jolma A, Campitelli LF, Das PK, Yin Y, Albu M, et al. The human transcription factors. Cell. 2018;172(4):650–65.

Hu H, Miao YR, Jia LH, Yu QY, Zhang Q, Guo AY. AnimalTFDB 3.0: a comprehensive resource for annotation and prediction of animal transcription factors. Nucleic Acids Res. 2019;47(D1):D33-8.

Fornes O, Castro-Mondragon JA, Khan A, van der Lee R, Zhang X, Richmond PA, et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020;48(D1):D87–92.

Han H, Cho JW, Lee S, Yun A, Kim H, Bae D, et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res. 2018;46(D1):D380–6.

Feng C, Song C, Liu Y, Qian F, Gao Y, Ning Z, et al. KnockTF: a comprehensive human gene expression profile database with knockdown/knockout of transcription factors. Nucleic Acids Res. 2020;48(D1):D93–100.

Mei S, Qin Q, Wu Q, Sun H, Zheng R, Zang C, et al. Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse. Nucleic Acids Res. 2017;45(D1):D658–62.

Elmentaite R, Ross ADB, Roberts K, James KR, Ortmann D, Gomes T, et al. Single-cell sequencing of developing human gut reveals transcriptional links to childhood Crohn’s disease. Dev Cell. 2020;55(6):771-83.e5.

Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinform. 2008;9:559.

Kazer SW, Aicher TP, Muema DM, Carroll SL, Ordovas-Montanes J, Miao VN, et al. Integrated single-cell analysis of multicellular immune dynamics during hyperacute HIV-1 infection. Nat Med. 2020;26(4):511–8.

Liao M, Liu Y, Yuan J, Wen Y, Xu G, Zhao J, et al. Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat Med. 2020;26(6):842–4.

Cheng S, Li Z, Gao R, Xing B, Gao Y, Yang Y, et al. A pan-cancer single-cell transcriptional atlas of tumor infiltrating myeloid cells. Cell. 2021;184(3):792-809.e23.

Matsumoto H, Kiryu H, Furusawa C, Ko MSH, Ko SBH, Gouda N, et al. SCODE: an efficient regulatory network inference algorithm from single-cell RNA-Seq during differentiation. Bioinformatics. 2017;33(15):2314–21.

Papili Gao N, Ud-Dean SMM, Gandrillon O, Gunawan R. SINCERITIES: inferring gene regulatory networks from time-stamped single cell transcriptional expression profiles. Bioinformatics. 2018;34(2):258–66.

Luo Q, Yu Y, Lan X. SIGNET: single-cell RNA-seq-based gene regulatory network prediction using multiple-layer perceptron bagging. Brief Bioinform. 2022;23(1):bbab547.

Chen J, Cheong C, Lan L, Zhou X, Liu J, Lyu A, et al. DeepDRIM: a deep neural network to reconstruct cell-type-specific gene regulatory network using single-cell RNA-seq data. Brief Bioinform. 2021;22(6):bbab325.

Pratapa A, Jalihal AP, Law JN, Bharadwaj A, Murali TM. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nat Methods. 2020;17(2):147–54.

Chen S, Mar JC. Evaluating methods of inferring gene regulatory networks highlights their lack of performance for single cell gene expression data. BMC Bioinform. 2018;19(1):232.

Alghamdi N, Chang W, Dang P, Lu X, Wan C, Gampala S, et al. A graph neural network model to estimate cell-wise metabolic flux using single-cell RNA-seq data. Genome Res. 2021;31(10):1867–84.

Artyomov MN, Van den Bossche J. Immunometabolism in the single-cell era. Cell Metab. 2020;32(5):710–25.

Gubin MM, Esaulova E, Ward JP, Malkova ON, Runci D, Wong P, et al. High-dimensional analysis delineates myeloid and lymphoid compartment remodeling during successful immune-checkpoint cancer therapy. Cell. 2018;175(4):1014-30.e19.

Ariss MM, Islam ABMMK, Critcher M, Zappia MP, Frolov MV. Single cell RNA-sequencing identifies a metabolic aspect of apoptosis in Rbf mutant. Nat Commun. 2018;9(1):5024.

Wu Y, Yang S, Ma J, Chen Z, Song G, Rao D, et al. Spatiotemporal immune landscape of colorectal cancer liver metastasis at single-cell level. Cancer Discov. 2022;12(1):134–53.

Raman K, Chandra N. Flux balance analysis of biological systems: applications and challenges. Brief Bioinform. 2009;10(4):435–49.

Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 2021;49(D1):D545–51.

Jassal B, Matthews L, Viteri G, Gong C, Lorente P, Fabregat A, et al. The reactome pathway knowledgebase. Nucleic Acids Res. 2020;48(D1):D498–503.

Orth JD, Thiele I, Palsson BØ. What is flux balance analysis? Nat Biotechnol. 2010;28(3):245–8.

Damiani C, Maspero D, Di Filippo M, Colombo R, Pescini D, Graudenzi A, et al. Integration of single-cell RNA-seq data into population models to characterize cancer metabolism. PLoS Comput Biol. 2019;15(2):e1006733.

Wagner A, Wang C, Fessler J, Detomaso D, Avila-Pacheco J, Kaminski J, et al. Metabolic modeling of single Th17 cells reveals regulators of autoimmunity. Cell. 2021;184(16):4168-85.e21.

Thiele I, Swainston N, Fleming RMT, Hoppe A, Sahoo S, Aurich MK, et al. A community-driven global reconstruction of human metabolism. Nat Biotechnol. 2013;31(5):419–25.

Pei W, Shang F, Wang X, Fanti AK, Greco A, Busch K, et al. Resolving fates and single-cell transcriptomes of hematopoietic stem cell clones by polyloxexpress barcoding. Cell Stem Cell. 2020;27(3):383-95.e388.

Basharat Z, Majeed S, Saleem H, Khan IA, Yasmin A. An overview of algorithms and associated applications for single cell RNA-Seq data imputation. Curr Genomics. 2021;22(5):319–27.

Hou W, Ji Z, Ji H, Hicks SC. A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 2020;21(1):218.

Van Dijk D, Sharma R, Nainys J, Yim K, Kathail P, Carr AJ, et al. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018;174(3):716–29.e27.

Wu X, Liu T, Ye C, Ye W, Ji G. scAPAtrap: identification and quantification of alternative polyadenylation sites from single-cell RNA-seq data. Brief Bioinform. 2021;22(4):bbaa273.

Patrick R, Humphreys DT, Janbandhu V, Oshlack A, Ho JWK, Harvey RP, et al. Sierra: discovery of differential transcript usage from polyA-captured single-cell RNA-seq data. Genome Biol. 2020;21(1):167.

Gao Y, Li L, Amos CI, Li W. Analysis of alternative polyadenylation from single-cell RNA-seq using scDaPars reveals cell subpopulations invisible to gene expression. Genome Res. 2021;31(10):1856–66.

Li GW, Nan F, Yuan GH, Liu CX, Liu X, Chen LL, et al. SCAPTURE: a deep learning-embedded pipeline that captures polyadenylation information from 3’tag-based RNA-seq of single cells. Genome Biol. 2021;22(1):221.

Zhou R, Xiao X, He P, Zhao Y, Xu M, Zheng X, et al. SCAPE: a mixture model revealing single-cell polyadenylation diversity and cellular dynamics during cell differentiation and reprogramming. Nucleic Acids Res. 2022;50(11):e66.

Wang X, Hou J, Quedenau C, Chen W. Pervasive isoform-specific translational regulation via alternative transcription start sites in mammals. Mol Syst Biol. 2016;12(7):875.

He Y, Chen Q, Zhang J, Yu J, Xia M, Wang X. Pervasive 3'-UTR isoform switches during mouse oocyte maturation. Front Mol Biosci. 2021;8:727614.

Philpott M, Watson J, Thakurta A, Brown T Jr, Brown T Sr, Oppermann U, et al. Nanopore sequencing of single-cell transcriptomes with scCOLOR-seq. Nat Biotechnol. 2021;39(12):1517–20.

Tian L, Jabbari JS, Thijssen R, Gouil Q, Amarasinghe SL, Voogd O, et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biol. 2021;22(1):310.

Rebboah E, Reese F, Williams K, Balderrama-Gutierrez G, McGill C, Trout D, et al. Mapping and modeling the genomic basis of differential RNA isoform expression at single-cell resolution with LR-Split-seq. Genome Biol. 2021;22(1):286.

Li J, Pan T, Chen L, Wang Q, Chang Z, Zhou W, et al. Alternative splicing perturbation landscape identifies RNA binding proteins as potential therapeutic targets in cancer. Mol Ther Nucleic Acids. 2021;24:792–806.

Huang HY, Lin YC, Li J, Huang KY, Shrestha S, Hong HC, et al. miRTarBase 2020: updates to the experimentally validated microRNA-target interaction database. Nucleic Acids Res. 2020;48(D1):D148–54.

Jiang T, Zhou W, Chang Z, Zou H, Bai J, Sun Q, et al. ImmReg: the regulon atlas of immune-related pathways across cancer types. Nucleic Acids Res. 2021;49(21):12106–18.

Chen W, Guillaume-Gentil O, Rainer PY, Gabelein CG, Saelens W, Gardeux V, et al. Live-seq enables temporal transcriptomic recording of single cells. Nature. 2022;608(7924):733–40.

Zhang K, Hocker JD, Miller M, Hou X, Chiou J, Poirion OB, et al. A single-cell atlas of chromatin accessibility in the human genome. Cell. 2021;184(24):5985-6001.e19.

Karemaker ID, Vermeulen M. Single-cell DNA methylation profiling: technologies and biological applications. Trends Biotechnol. 2018;36(9):952–65.

Zhang R, Zhou T, Ma J. Multiscale and integrative single-cell Hi-C analysis with Higashi. Nat Biotechnol. 2022;40(2):254–61.

Lee J, Hyeon DY, Hwang D. Single-cell multiomics: technologies and data analysis methods. Exp Mol Med. 2020;52(9):1428–42.

Long Z, Sun C, Tang M, Wang Y, Ma J, Yu J, et al. Single-cell multiomics analysis reveals regulatory programs in clear cell renal cell carcinoma. Cell Discov. 2022;8(1):68.

Marx V. Method of the year: spatially resolved transcriptomics. Nat Methods. 2021;18(1):9–14.

Download references

Acknowledgements

Not applicable.

This work was supported by the National Key Research and Development Program of China (2022YFC2702502), the National Natural Science Foundation of China (32170742, 31970646, and 32060152), the Start Fund for Specially Appointed Professor of Jiangsu Province, Hainan Province Science and Technology Special Fund (ZDYF2021SHFZ051), Hainan Provincial Natural Science Foundation of China (820MS053), the Start Fund for High-level Talents of Nanjing Medical University (NMUR2020009), Marshal Initiative Funding of Hainan Medical University (JBGS202103), Hainan Province Clinical Medical Center (QWYH202175), Bioinformatics for Major Diseases Science Innovation Group of Hainan Medical University, and Shenzhen Science and Technology Program, China (JCYJ20210324140407021).

Author information

Min Su, Tao Pan, Qiu-Zhen Chen and Wei-Wei Zhou contributed equally to this work

Authors and Affiliations

State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166, China

Min Su, Qiu-Zhen Chen, Yi Gong, Huan-Yu Yan, Qiao-Zhen Shi & Xi Wang

College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, Hainan, China

Tao Pan, Gang Xu, Si Li, Ya Zhang & Yong-Sheng Li

College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, Heilongjiang, China

Wei-Wei Zhou & Xia Li

Department of Immunology, Nanjing Medical University, Nanjing, 211166, China

Department of Laboratory Medicine, Women and Children’s Hospital of Chongqing Medical University, Chongqing, 401174, China

Baylor College of Medicine, Houston, TX, 77030, USA

Chun-Jie Jiang

Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, 518110, Guangdong, China

Shi-Cai Fan

School of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, the University of Newcastle, University Drive, Callaghan, NSW, 2308, Australia

Murray J. Cairns

Precision Medicine Research Program, Hunter Medical Research Institute, New Lambton Heights, NSW, 2305, Australia

You can also search for this author in PubMed   Google Scholar

Contributions

XW and YSL conceived the project. MS constructed the online repository of software and scripts. TP, QZC, WWZ, YG, GX, QZS, SL, HYY, YZ, and XW collected the literature and drafted the manuscript. MJC, XH, CJJ, SCF, and XL commented on the manuscript. MS, MJC, YSL, and XW revised the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Xia Li , Murray J. Cairns , Xi Wang or Yong-Sheng Li .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Supplementary Information

Additional file 1. table s1.

: Tools for analyzing single-cell RNA-seq data, with references and links.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Su, M., Pan, T., Chen, QZ. et al. Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications. Military Med Res 9 , 68 (2022). https://doi.org/10.1186/s40779-022-00434-8

Download citation

Received : 27 September 2022

Accepted : 18 November 2022

Published : 02 December 2022

DOI : https://doi.org/10.1186/s40779-022-00434-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Single-cell RNA-sequencing (scRNA-seq)
  • Data analysis
  • Biomedical research
  • Clinical applications

Military Medical Research

ISSN: 2054-9369

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

analysis of ehr data for clinical research

OutsourcingPharma

  • News & Analysis on Clinical Trial Services & Contract Research And Development

OutsourcingPharma

eClinical innovations streamline clinical trial processes: Insights from Dawn Kaminski

13-Mar-2024 - Last updated on 13-Mar-2024 at 17:05 GMT

  • Email to a friend

eClinical empowering organizations to be more efficient and effective

eClinical Innovations Streamline Clinical Trial Processes: Insights from Dawn Kaminski

In the fast-evolving landscape of clinical trials, technological advancements play a vital role in streamlining processes, enhancing efficiency, and ultimately improving patient outcomes. During SCOPE 2024 in Orlando, Dawn Kaminski, vice president of business development operations at eClinical, shared valuable insights into the company's latest innovations and their impact on the industry.

Dawn Kaminski

Kaminski, with over 25 years of experience in the field, outlined eClinical's commitment to revolutionizing clinical trial operations through their platform, the eClinical Data Cloud. In this exclusive interview, with Outsourcing Pharma’s Liza Laws, she highlighted the platform's role in accelerating processes and ensuring data integrity across trials.

“At eClinical, we are dedicated to leveraging technology to drive efficiencies in clinical trial management," Kaminski stated. "Our platform, the eClinical Data Cloud, offers a comprehensive solution to optimize various aspects of trial conduct, from data collection to analysis.”

Kaminski emphasized the platform's ability to reduce cycle times significantly, citing examples such as database lock within two weeks, a process that traditionally took six to eight weeks. This acceleration is attributed to eClinical's continuous data-cleaning approach, which contrasts with the batch-cleaning method used by many in the industry.

In addition to speeding up processes, eClinical's platform integrates cutting-edge technologies like artificial intelligence (AI) to enhance data review and anomaly detection. Kaminski highlighted the platform's AI models, which map to various data review objectives, allowing for targeted analysis and risk mitigation.

“Our AI-powered anomaly detection capabilities enable early identification of potential issues, enabling proactive intervention," Kaminski noted. “This proactive approach is essential for maintaining data integrity and minimizing risks throughout the trial lifecycle.”

Another key feature of eClinical's platform is its risk-based quality management software, designed to mitigate risks as trials progress. Kaminski underscored the importance of proactive risk management in ensuring trial success and avoiding surprises at trial endpoints.

“By using our risk-based quality management software, sponsors can identify and address potential risks in real-time, minimizing the likelihood of delays or deviations," Kaminski said. "This proactive risk mitigation strategy is integral to maintaining trial integrity and achieving regulatory compliance.”

Reflecting on industry trends, Kaminski acknowledged the growing complexity of clinical trial data, driven by factors such as the proliferation of digital health technologies and the increasing volume of real-world data. She emphasized the importance of embracing these data sources effectively to derive meaningful insights and drive decision-making.

“The clinical trial landscape is evolving rapidly, with advancements in digital health technologies and the availability of real-world data," Kaminski noted. “To stay ahead, organizations must adapt their processes and leverage technology to manage and analyze diverse data sources effectively.”

Looking ahead, Kaminski expressed optimism about the future of clinical trials, fueled by continued innovation and collaboration within the industry. She highlighted eClinical's commitment to staying at the forefront of these developments, with a focus on delivering solutions that address evolving industry needs.

“As the clinical trial landscape continues to evolve, we remain committed to driving innovation and delivering solutions that empower organizations to conduct trials more efficiently and effectively," Kaminski affirmed. "By harnessing the power of technology and embracing collaboration, we can overcome challenges and drive positive change in the industry."

In conclusion, Dawn Kaminski's insights shed light on eClinical's ongoing efforts to revolutionize clinical trial operations through innovative technologies and proactive risk management strategies. As the industry embraces digital transformation, platforms like the eClinical Data Cloud are poised to play a crucial role in shaping the future of clinical research. 

Related news

© Getty Images

Related products

More Data, More Insights, More Progress

More Data, More Insights, More Progress

Content provided by Saama | 04-Mar-2024 | Case Study

The sponsor’s clinical development team needed a flexible solution to quickly visualize patient and site data in a single location

Using Define-XML to build more efficient studies

Using Define-XML to build more efficient studies

Content provided by Formedix | 14-Nov-2023 | White Paper

It is commonly thought that Define-XML is simply a dataset descriptor: a way to document what datasets look like, including the names and labels of datasets...

Why should you use clinical trial technology?

Why should you use clinical trial technology?

Content provided by Formedix | 01-Nov-2023 | White Paper

New, innovative clinical trial technology is helping to revolutionize the research landscape. COVID-19 demonstrated that clinical trials can be run much...

When Every Day Counts

When Every Day Counts

Content provided by Saama | 25-Oct-2023 | Case Study

While developing a vaccine for COVID-19, a Top 3 pharmaceutical company had an urgent need

Related suppliers

  • Sofgen Pharmaceuticals
  • More Data, More Insights, More Progress Saama | Download Case Study
  • Ultra Low Temperature Packaging solutions Almac Group | Download Case Study
  • Three Ways to Leverage AI in Clinical Trials Right Now Saama | Download Technical / White Paper
  • When Every Day Counts Saama | Download Case Study

On-demand webinars

  • Innovation in Drug Delivery Webinar

© Getty Images - credit PonyWang

Promotional Features

How decentralized clinical trials can improve processes for patients

Outsourcing-Pharma

  • Advertise with us
  • Why Register?
  • Apply to reuse our content
  • Press Releases – Guidelines
  • Contact the Editor
  • Report a technical problem
  • Whitelist our newsletters
  • Editorial Calendar

analysis of ehr data for clinical research

Network analysis of unstructured EHR data for clinical research

Affiliation.

  • 1 Stanford University, Stanford, CA, USA.
  • PMID: 24303229
  • PMCID: PMC3845760

In biomedical research, network analysis provides a conceptual framework for interpreting data from high-throughput experiments. For example, protein-protein interaction networks have been successfully used to identify candidate disease genes. Recently, advances in clinical text processing and the increasing availability of clinical data have enabled analogous analyses on data from electronic medical records. We constructed networks of diseases, drugs, medical devices and procedures using concepts recognized in clinical notes from the Stanford clinical data warehouse. We demonstrate the use of the resulting networks for clinical research informatics in two ways-cohort construction and outcomes analysis-by examining the safety of cilostazol in peripheral artery disease patients as a use case. We show that the network-based approaches can be used for constructing patient cohorts as well as for analyzing differences in outcomes by comparing with standard methods, and discuss the advantages offered by network-based approaches.

Grants and funding

  • U54 HG004028/HG/NHGRI NIH HHS/United States
  • Open access
  • Published: 23 November 2023

Impact of frailty on the outcomes of patients undergoing degenerative spine surgery: a systematic review and meta-analysis

  • Wonhee Baek 1 ,
  • Sun-Young Park 2 &
  • Yoonjoo Kim 3  

BMC Geriatrics volume  23 , Article number:  771 ( 2023 ) Cite this article

712 Accesses

1 Altmetric

Metrics details

Degenerative spinal diseases are common in older adults with concurrent frailty. Preoperative frailty is a strong predictor of adverse clinical outcomes after surgery. This study aimed to investigate the association between health-related outcomes and frailty in patients undergoing spine surgery for degenerative spine diseases.

A systematic review and meta-analysis were performed by electronically searching Ovid-MEDLINE, Ovid-Embase, Cochrane Library, and CINAHL for eligible studies until July 16, 2022. We reviewed all studies, excluding spinal tumours, non-surgical procedures, and experimental studies that examined the association between preoperative frailty and related outcomes after spine surgery. A total of 1,075 articles were identified in the initial search and were reviewed by two reviewers, independently. Data were subjected to qualitative and quantitative syntheses by meta-analytic methods.

Thirty-eight articles on 474,651 patients who underwent degenerative spine surgeries were included and 17 papers were quantitatively synthesized. The health-related outcomes were divided into clinical outcomes and patient-reported outcomes; clinical outcomes were further divided into postoperative complications and supportive management procedures. Compared to the non-frail group, the frail group was significantly associated with a greater risk of high mortality, major complications, acute renal failure, myocardial infarction, non-home discharge, reintubation, and longer length of hospital stay. Regarding patient-reported outcomes, changes in scores between the preoperative and postoperative Oswestry Disability Index scores were not associated with preoperative frailty.

Conclusions

In degenerative spinal diseases, frailty is a strong predictor of adverse clinical outcomes after spine surgery. The relationship between preoperative frailty and patient-reported outcomes is still inconclusive. Further research is needed to consolidate the evidence from patient-reported outcomes.

Peer Review reports

As the incidence of degenerative spinal diseases has increased and with advancements in medical technology [ 1 , 2 ], the number of older adults undergoing spine surgeries has increased [ 3 , 4 ]. Accordingly, difficulties encountered during spine surgeries have also increased [ 4 , 5 ]. Because the outcomes of patients undergoing spine surgery are affected by their preoperative characteristics [ 6 , 7 , 8 ], it becomes imperative to gain insights into factors that may impact postoperative outcomes in this population, including frailty. Frailty is defined as a multidimensional state of loss of physical, cognitive, social, and psychological functioning [ 9 ]. The older the age, the higher the frailty; however, compared to chronological age, frailty status can better predict complications and mortality following spine surgery [ 10 ]. Most patients undergoing spine surgeries are prefrail or frail [ 7 , 11 ], conditions which are often associated with preoperative pain, spinal deformity, and reduced ability to perform activities of daily living. For spine surgery, the incidence of postoperative complications and non-home discharge, length of hospital stay, and mortality rates are higher among patients with preoperative frailty than among those without [ 7 , 12 ]. Therefore, preoperative risk stratification of frailty is helpful for predicting postoperative deterioration; this in turn can help prevent the worsening of outcomes after a spine surgery [ 9 ].

Patients with frailty who have undergone spine surgery do not experience the same level of benefit in terms of clinical outcomes (COs) as those who are not frail [ 13 , 14 ]. Even then, such patients often opt for spine surgery to alleviate pain and improve function rather than for survival (unlike patients who opt for cancer surgery) [ 15 ]. Therefore, providing patients with information on the benefits of patient-reported outcomes (PROs) after spine surgery can help them make informed decisions and receive more patient-centred care. With the increased emphasis on the importance of PROs, research has increasingly focused on how PROs in frail patients have changed following spine surgery [ 13 , 16 ]. However, there is a lack of understanding of the benefits and expected types of PROs in spine surgery. Therefore, a systematic literature review and meta-analysis of the relationship between preoperative frailty and the postoperative outcomes of surgery for patients with degenerative spinal disease is necessary.

A 2021 systematic review and meta-analysis of 32 studies on preoperative frailty and outcomes of spine surgery revealed that frailty was associated with increased adverse events, mortality, length of hospital stay, readmission, reoperation, non-home discharge, intensive care unit stay, and PROs following a spine surgery [ 17 ]. However, this review had the following limitations: studies on simple procedures such as kyphoplasty were included in the review; therefore, the risk of bias regarding non-surgical procedures could not be ruled out. Furthermore, because disease pathogenesis and progression differ between patients with spinal neoplasms and metastases and those with degenerative spine disease, both cohorts must be analysed separately. However, the study mentioned above included both patients with spinal neoplasms and those with degenerative spinal diseases. Moreover, interpretation of the findings of the meta-analysis was limited because the postoperative adverse events were not differentiated in detail, a synthesis of evidence on the patient-reported outcomes was not performed, and the method for the meta-analysis was not described clearly [ 17 , 18 , 19 ].

Two parameters help to identify frailty status. These include the frailty phenotype [ 20 ] and the frailty index (FI) [ 21 ]. Regarding the frailty phenotype, frailty is determined by the following symptoms: unintentional weight loss, self-reported exhaustion, weakness, slow walking speed, and low physical activity [ 20 ]. The FI is obtained by dividing the sum of a patient’s deficits by the total sum of frailty-related deficits. It has two types, namely adult spinal deformity (ASD)-FI [ 13 ] and cervical deformity (CD)-FI [ 22 ]. Recently, modified FI (mFI) has also been used for determining frailty [ 23 ]; each clinical institution has developed and used a different frailty tool [ 24 ]. Determining the risk stratification of frailty before spine surgery helps determine the prognosis and treatment of patients. Thus, we aimed to explore the following: (1) tools used to measure the frailty of patients prior to surgery for degenerative spine disease, (2) types of frailty-related health-related outcomes following spine surgery, and (3) association between preoperative frailty and health-related outcomes.

We followed the recommendations of the Cochrane Handbook to confirm the outcome of frailty [ 25 ]. The final protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO; registration number: CRD42021286341).

Search strategy

Electronic bibliographic databases, including Ovid-MEDLINE, Ovid-EMBASE, Cochrane Library (Cochrane Database of Systematic Reviews), and CINAHL (Cumulative Index of Nursing and Allied Health), were screened for relevant articles. The search terms were “spine,” “frailty,” “postoperative,” and “outcome” and the Boolean operators OR and were used to combine them. The search was completed on July 16, 2022. The search strategies for each database are presented in Supplementary Material Table 1.

Eligibility criteria

The inclusion criteria were as follows: (1) articles on patients who underwent spine surgery; (2) articles on studies that compared health-related outcomes (COs and PROs) after spine surgery with respect to preoperative frailty status, (3) articles in English published in peer-reviewed journals; and (4) articles on prospective or retrospective cohort, case-control, and cross-sectional studies. The exclusion criteria were as follows: (1) reviews, case reports, and unpublished manuscripts; (2) articles on studies that included spinal tumours; (3) articles on experimental studies (interventions could confound the relationship between frailty and postoperative health-related outcomes); (4) articles on studies that included non-surgical procedures. No restrictions were placed on the timing of publication.

Article selection and data extraction

Articles were first downloaded using reference management software (EndNote version 20, Clarivate Analytics, USA). Then, Rayyan was used to screen the downloaded articles and remove any duplicates [ 26 ]. Two authors (WB and YK) independently read the titles and abstracts of the remaining articles and selected those that met the eligibility criteria. Thereafter, the full texts of the selected articles were reviewed; any discrepancies in the selection process were resolved after discussion with another author (SP). Using a standardized record extraction form, the two aforementioned reviewers independently extracted the following data from the selected articles: first author’s name, year and country of publication, demographic and clinical characteristics of the study population, population demographics, type of surgery, measurement tool and outcomes, and follow-up duration.

Risk of bias in individual studies

The Risk of Bias Assessment Tool for Nonrandomized Studies (RoBANS) was used to assess the quality of the included studies [ 27 ]. The RoBANS evaluated the risk of bias for the following six domains: participant selection, confounding variables, measurement of exposure, blinding of outcome assessments, incomplete outcome data, and selective outcome reporting. Each domain was assessed as having a “low risk of bias”, “unclear risk of bias,” or “high risk of bias.” The two aforementioned authors independently evaluated the methodological quality of the studies and later combined their findings.

Synthesis and statistical analysis

All data analyses were performed using R (version 4.0.3, R Foundation for Statistical Computing, Austria). We performed a qualitative synthesis to determine what tools were used to measure frailty in patients undergoing spine surgery and what indicators were used for frailty and health-related outcomes. Thereafter, quantitative synthesis was performed to confirm the direction and magnitude of the association between frailty and health-related outcomes.

We divided the postoperative health-related outcomes into COs and PROs. The meta-analysis was performed if the following conditions were met: (1) there were three or more papers that could be synthesized, (2) the participants could be divided into frail and non-frail groups, (3) COs were synthesized only if the terms used in each paper were identical, and (4) the same participants were extracted from the same database in the same year (the paper that was published first was selected).

The Mantel–Haenszel method was used to estimate the pooled odds ratio (OR) with the 95% confidence interval (CI) for dichotomous variables. The inverse variance method was used to estimate the pooled mean difference (MD) with the 95% CI for continuous variables. A fixed-effect model was used for homogeneous studies, while a random-effects model was used for heterogeneous studies [ 25 ]. The I 2 value was used to investigate the heterogeneity among the included studies; an I 2 value > 50% was considered indicative of substantial heterogeneity [ 28 ].

Because tests for publication bias need to be evaluated when there are more than 10 studies in a meta-analysis, statistical tests were not attempted to identify publication bias in our study. Sensitivity analysis was performed while excluding papers that were judged to increase the heterogeneity and cause a bias in the effect size in the meta-analysis [ 25 ]. Statistical significance was defined by p-value < 0.05.

Study selection

The study selection process is shown in Fig.  1 . The initial search of the databases yielded 1,075 potentially relevant articles; one additional article was identified from other sources [ 29 ]. Among these, 732 articles remained after the removal of duplicates. After screening their titles and abstracts, 632 of these articles were excluded. The full texts of the remaining 100 articles were reviewed, and 62 articles were further excluded. The remaining 38 articles were finally included for quality evaluation and qualitative synthesis [ 7 , 10 , 11 , 12 , 13 , 14 , 16 , 22 , 23 , 24 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 ]. Among these, 17 were subjected to a quantitative synthesis for the meta-analysis [ 10 , 13 , 16 , 22 , 29 , 30 , 33 , 35 , 39 , 40 , 41 , 42 , 47 , 49 , 52 , 55 , 56 ].

figure 1

Preferred reporting items for systematic reviews and meta-analyses-based flowchart of the article screening and selection process

Study characteristics

The characteristics of the included studies are presented in Table  1 . The countries of the patients who participated in the study were North America (n = 25) [ 7 , 10 , 11 , 12 , 13 , 14 , 22 , 23 , 24 , 29 , 31 , 32 , 37 , 40 , 41 , 42 , 43 , 44 , 45 , 47 , 48 , 49 , 51 , 52 , 53 , 56 ], Korea (n = 5) [ 30 , 33 , 34 , 35 , 36 ], China (n = 2) [ 16 , 50 ], Europe (n = 2) [ 38 , 46 ], Japan (n = 2) [ 54 , 55 ]. One study included patients from Europe, Asia, and North America [ 39 ]. Overall, 34 retrospective cohort studies [ 7 , 10 , 13 , 14 , 16 , 22 , 23 , 24 , 29 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 ], 3 prospective cohort studies [ 11 , 30 , 39 , 48 ], and 1 mixed retrospective and prospective cohort study [ 12 ] were included. The articles were published between 2016 and 2022. Overall, the studies comprised 474,651 patients who underwent spine surgery (mean age: 56.6–78.3 years).

Risk of bias

Supplementary Material Fig. 1 summarizes the results of the assessments of the risk of bias in the included studies. The overall quality of the included studies was good. However, there were concerns regarding selection bias for six out of 38 studies [ 23 , 29 , 36 , 45 , 46 , 47 ]. These studies analysed multi-centre data and had a retrospective design, but did not report the confounding variables. Eleven studies [ 10 , 14 , 16 , 22 , 23 , 29 , 34 , 35 , 40 , 52 , 53 ] did not report the presence of incomplete outcome data, such as missing data or non-response rates. In more than 80% of the studies, five of the six evaluated domains were assessed as having a low risk of bias (attrition bias was excluded). No studies were excluded based on quality assessment.

Frailty measurements

The measurement tools for preoperative frailty included the mFI-11 (n = 15) [ 10 , 12 , 16 , 23 , 30 , 32 , 33 , 35 , 41 , 44 , 49 , 50 , 53 , 54 , 55 ], mFI-5 (n = 10) [ 7 , 30 , 31 , 34 , 44 , 45 , 52 , 53 , 55 , 56 ], ASD-FI (n = 6) [ 13 , 37 , 38 , 39 , 42 , 47 ], Hospital Frailty Risk Score (n = 2) [ 14 , 46 ], Johns Hopkins Adjusted Clinical Groups indicator (n = 2) [ 24 , 51 ], mCD-FI (n = 2) [ 29 , 43 ], frailty phenotype (n = 3) [ 11 , 36 , 48 ], CD-FI (n = 1) [ 22 ], comprehensive geriatric assessment (n = 1) [ 30 ], and mASD-FI (n = 1) [ 40 ]. In these studies, the patients were divided into non-frail, prefrail, frail, or severely frail groups or into the low frailty, medium frailty, and high frailty groups, according to their criteria.

Health-related outcomes after spine surgery

In the included studies, postoperative health-related outcomes were classified into COs and PROs (Table  1 ; Fig.  2 , and Supplementary Material Table 2).

figure 2

Health-related outcomes in terms of preoperative frailty status. IADL, instrumental activities of daily living; EQ-5D, EuroQol-5D; JOA, Japanese orthopedic association scale; mJOA, modified Japanese orthopedic association scale; NDI, neck disability index; ODI, Owestry disability index; NRS, numerical rating scale; PQRS, postoperative quality of recovery scale; ADL, activity of daily living; SF-36, 36-item short-form survey; SRS-22, Scoliosis Research Society 22-question; VAS, visual analog scale; QALY, quality-adjusted life years; ICU, intensive care unit

Clinical outcomes

All studies, except one [ 47 ], considered COs as postoperative health-related outcomes. The COs included postoperative complications and supportive management procedures.

In 35 studies, the postoperative complications were addressed as COs [ 7 , 10 , 11 , 12 , 14 , 16 , 22 , 23 , 24 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 ]. The postoperative complications were further divided into general and surgical complications. The general complications comprised anaemia; electrolyte abnormalities; cardiovascular, gastrointestinal, pulmonary, renal, and urinary complications; delirium; deep vein thrombosis; falls; and sepsis/septic shock. The surgical complications comprised dural tears, excessive bleeding, hematomas, instrumentation failure, neurological symptoms, positional and wound-related complications, pseudoarthrosis, pneumoperitoneum, and kyphosis. These complications were classified as minor or major or I–IV (Clavien–Dindo classification) [ 57 ]. In five studies [ 16 , 22 , 37 , 38 , 39 ], the definition provided by Glassman et al. was used to determine the major complications [ 58 , 59 ]. In 13 studies [ 10 , 12 , 23 , 24 , 29 , 32 , 35 , 41 , 44 , 49 , 50 , 52 , 53 ], mortality was considered a postoperative complication.

Supportive management procedures included transfusion for bleeding [ 10 , 41 , 46 , 52 ], admissions to intensive care units [ 14 , 22 ], length of hospital stay [ 11 , 12 , 14 , 22 , 24 , 29 , 31 , 32 , 33 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 45 , 50 , 53 , 56 ], length of bed rest [ 33 ], nonhome discharge [ 7 , 11 , 12 , 14 , 22 , 24 , 29 , 32 , 50 , 51 , 52 , 53 , 56 ], postoperative ventilator use [ 52 ], reintubation [ 35 , 52 , 56 ], readmission [ 7 , 14 , 24 , 31 , 44 , 50 , 52 , 53 , 56 ], reoperation [ 7 , 10 , 29 , 31 , 33 , 37 , 38 , 40 , 41 , 43 , 46 , 50 , 53 ], and emergency room visit [ 14 ].

Other COs included costs [ 13 , 14 , 24 , 51 ], frailty status [ 48 ], and radiographic imaging findings [ 13 , 16 , 43 , 54 ].

Patient reported outcomes

Eleven studies assessed PROs [ 13 , 16 , 29 , 33 , 40 , 42 , 43 , 47 , 48 , 50 , 54 ]. The PROs were assessed using the instrumental activities of daily living [ 48 ], EuroQol-5D (EQ-5D) [ 13 , 29 , 40 ], Japanese Orthopaedic Association (JOA) score [ 16 ], modified mJOA score [ 43 ], Neck Disability Index [ 29 , 43 ], Oswestry Disability Questionnaire (ODI) [ 13 , 16 , 33 , 40 , 42 , 47 , 50 , 54 ], numerical rating scale for pain [ 29 , 42 , 43 , 47 ], Postoperative Quality of Recovery Scale for cognitive recovery and activities of daily living [ 48 ], Pain Catastrophizing Scale [ 40 ], 36-Item Short Form Survey (SF-36) [ 47 , 50 , 54 ], Scoliosis Research Society 22-question [ 16 , 40 , 42 , 54 ], and visual analogue scale for pain [ 16 , 33 , 54 ].

Substantial clinical benefit was determined based on changes in the ODI, SF-36 score, and back and leg pain score after the surgery [ 33 , 47 ]. The quality-adjusted life years were determined using the EQ-5D [ 13 ].

Meta-analysis of the selected outcomes

Synthesis of meta-analysis results regarding the clinical outcomes.

Results of the meta-analysis of the COs are presented in Table  2 . A forest plot depicting significant associations between COs and frailty is shown in Fig.  3 . Compared to the non-frail group, the frail group was more likely to experience the following COs: mortality (OR = 2.5; 95% CI = 1.4–4.4) [ 10 , 35 , 52 ], major complication (OR = 2.8; 95% CI = 2.3–3.5) [ 39 , 42 , 49 , 56 ], any complication (OR = 2.1; 95% CI = 2.0–2.3) [ 10 , 29 , 35 , 39 , 40 , 42 , 52 , 55 , 56 ], general complication (OR = 1.6; 95% CI = 1.4–1.7) [ 22 , 30 , 52 ], acute renal failure (OR = 3.3; 95% CI = 1.8–6.1) [ 16 , 35 , 52 , 56 ], cardiac arrest (OR = 2.9; 95% CI = 1.7–5.0) [ 29 , 35 , 52 , 56 ], deep vein thrombosis (OR = 1.4; 95% CI = 1.0–2.0) [ 16 , 35 , 52 , 56 ], gastrointestinal complication (OR = 0.9; 95% CI = 0.4–1.9) [ 16 , 29 , 33 , 42 ], myocardial infarction (OR = 4.8; 95% CI = 3.3–7.0) [ 35 , 52 , 56 ], pneumonia (OR = 2.4; 95% CI = 1.4–4.1) [ 16 , 29 , 35 , 52 , 56 ], pulmonary embolism (OR = 1.5; 95% CI = 1.0–2.1) [ 35 , 52 , 56 ], sepsis (OR = 2.4; 95% CI = 1.7–3.2) [ 10 , 35 , 52 , 56 ], stroke/cerebrovascular accident (OR = 2.1; 95% CI = 0.5–8.5) [ 16 , 35 , 41 ], urinary tract infection (OR = 2.2; 95% CI = 1.1–4.6) [ 10 , 29 , 33 , 35 ], surgical complication (OR = 1.6; 95% CI = 1.4–1.9) [ 22 , 30 , 52 ], deep wound infection (OR = 1.8; 95% CI = 1.3–2.5) [ 16 , 29 , 52 , 56 ], implant-related complication (OR = 2.1; 95% CI = 1.4–3.2) [ 29 , 33 , 41 , 42 , 55 ], neurological complication (OR = 1.1; 95% CI = 0.6–1.7) [ 16 , 29 , 33 , 41 , 42 ], superficial surgical site infection (OR = 1.7; 95% CI = 1.3–2.2) [ 29 , 35 , 52 , 56 ], length of stay (MD = 3.1; 95% CI = 1.2–5.0) [ 13 , 16 , 24 , 33 , 37 , 38 , 51 ], non-home discharge (OR = 2.6; 95% CI = 2.1–3.2) [ 22 , 52 , 56 ], reintubation (OR = 3.4; 95% CI = 2.4–4.7) [ 35 , 52 , 56 ], and reoperation (OR = 1.0; 95% CI = 0.4–2.5) [ 10 , 29 , 33 , 52 ]. The forest plot for each CO is presented in Supplementary Material Fig. 2.

figure 3

Forest plots of the clinical outcomes that showed significant results in the meta-analysis. SSI, surgical site infection; OR, odds ratio; MD, mean difference; CI, confidence interval

The incidence rates of complications in the frail group and the robust group are presented in Supplementary Table 3. In the robust group, the five most prevalent complications, in descending order, were as follows: gastrointestinal complications (5.6%), urinary tract infection (4.6%), implant-related complications (1.5%), neurological complications (1.4%), and superficial surgical site infections (0.6%). In contrast, in the frail group, the five most prevalent complications, in descending order, were as follows: implant-related complications (21.5%), neurological complications (13.6%), urinary tract infections (9.3%), gastrointestinal complications (5.6%), and stroke/cerebrovascular accidents (2.1%).

Synthesis of meta-analysis results regarding the patient-reported outcomes

Results of the meta-analysis of the PROs are presented in Table  2 . A forest plot for the PROs is shown in Supplementary Material Fig. 3. Changes in the ODI scores between pre- and post-surgery, categorized by frailty, were synthesized based on three papers [ 13 , 16 , 47 ]. The changes between pre- and post-operative ODI scores were not associated with preoperative frailty (MD= -9.6, 95% CI= -23–3.8).

Sensitivity analysis

A sensitivity analysis was performed to identify the relationship between any complication and frailty, which had the highest number of synthesized papers. As shown in the forest plot for any complication (Supplementary Material Figs. 2 and 3), it was judged that heterogeneity occurred due to the articles by Passias et al. [ 29 ] and Kim et al. [ 35 ]. When a meta-analysis was performed by removing those two articles, the I 2 value was reduced to 53% and 47%, respectively (Supplementary Material Fig. 4). Therefore, after removing these two papers, the meta-analysis was performed again (Supplementary Material Fig. 5). A fixed-effect model was selected because the heterogeneity was reduced to 10% for I 2 . The OR for any complication was 2.1 (95% CI = 2.0–2.3), which did not differ significantly from the original OR of 2.1. The findings of the sensitivity analysis indicate that the results of this study are reliable.

This systematic review and meta-analysis examined the association between preoperative frailty and postoperative health-related outcomes in patients who underwent spine surgery for degenerative spinal disease. In the 38 included studies, 10 frailty instruments were used to measure preoperative frailty and two typologies of health-related outcomes for the preoperative frailty status were identified. Preoperative frailty was observed to be associated with postoperative adverse health-related outcomes. It increased the incidence of adverse COs, including mortality and complications, but there was no significant difference with respect to the improvement of the postoperative PROs.

Research on frailty has increased appreciably recently; this includes studies on preoperative frailty and its association with COs [ 15 , 60 ] or PROs [ 61 ] and studies on the construct validity of frailty instruments [ 62 ]. Previous studies conducted in surgical settings highlight the important role of frailty as a prognostic factor for considering surgery [ 15 , 60 , 61 , 63 ]. A systematic review and meta-analysis of 19 studies on patients undergoing cardiac surgery revealed that frailty was associated with a two-fold greater risk of mortality, greater complications, and five-fold greater risk of non-home discharge [ 60 ]. In another systematic review and meta-analysis of 71 studies on adult patients undergoing cancer surgery, frailty was found to be related to a three-fold, two-fold, and four-fold greater risk of 30-day mortality, postoperative complications, and long-term mortality, respectively [ 15 ]. Our findings corroborate and extend the existing evidence on the association of preoperative frailty with postoperative adverse COs.

Factors other than age should be considered when predicting postoperative recovery in patients with degenerative spinal diseases [ 17 , 20 ]. The prevalence of frailty is increasing among individuals undergoing spine surgeries. Analysis of a patient population that underwent spine surgery, using data from the American College of Surgeons National Surgical Quality Improvement Program database, revealed that the number of frail patients doubled from 2005 to 2016 [ 44 ]. This suggests that frailty is an important variable to consider for risk stratification when predicting postoperative recovery in patients with degenerative spinal disease [ 17 , 20 ]. The frailty score may serve as a preoperative screening tool to aid in decision-making and perioperative management. It can help monitor patients’ health, thereby allowing healthcare professionals to identify high-risk patients and develop better treatment strategies. It can also help guide discussions among healthcare professionals, patients, and family members to reduce surgical vulnerability, enable pre-habilitation to increase patient resilience, and customize perioperative care [ 64 , 65 ].

In our qualitative synthesis, clinical outcomes were identified as health-related outcomes in all but one study [ 47 ]. Postoperative complications can be divided into general and surgery-related complications. Supportive management strategies include blood transfusions and unplanned intubations; these represent additional supportive care provided to patients with problems that are not part of the normal recovery process.

Among the COs in this study, 19 items were synthesized for quantitative analysis, and 3–9 studies participated in the synthesis. If there are fewer than 10 studies, statistical confirmatory tests for publication bias (e.g. the funnel test) are not recommended [ 25 ]; thus, publication bias could not be confirmed in this study. Therefore, items that showed heterogeneity, such as any complications, pneumonia, length of hospital stay, non-home discharge, and reoperation, should be interpreted carefully. In case of any complications, a sensitivity analysis was performed because the number of studies was considerably large and heterogeneity was noted across the studies. This analysis identified two studies as outliers [ 29 , 35 ], and the synthesis was attempted again by excluding them. The re-analysis revealed that the heterogeneity improved and the effect size did not affect the existing results.

The meta-analysis of the clinical outcomes in this study revealed that the risk of mortality in the frail group was 2.5 times higher than that in the non-frail group. Furthermore, the probability of major complication, any complication, general complication, acute renal failure, cardiac arrest, deep vein thrombosis, myocardial infarction, pneumonia, pulmonary embolism, sepsis, stroke/cerebrovascular accident, surgical complication, deep-wound infection, implant-related complication, superficial surgical site infection, length of hospital stay, nonhome discharge, and reintubation was higher in the frail group than in the non-frail group. Notably, the order of complication prevalence was different between the robust and frail groups. In the robust group, the most common complication was relatively simple gastrointestinal complications, while in the frail group, relatively severe implant-related complications, which might necessitate reoperation, were the most common. The increased incidence of complications or the severity of complications in frail patients can be attributed to several factors. Frailty is linked to reduced immune function, which can result in compromised ability to cope with complications such as infections during the stress of post-surgery recovery [ 66 ]. Frailty is associated with decreased metabolic activity, such as high levels of glucose and LDL cholesterol, which can impair tissue nutrient supply and metabolic functions [ 67 ], ultimately hindering post-surgery recovery capacity. Furthermore, frailty is associated with low physical activity levels and reduced muscle mass [ 66 , 68 ], which might persist post-surgery, leading to compromised recovery due to limited physical activity. Healthcare professionals who deliver postoperative care to frail patients should be aware of these complications. This can lead to increases in the time of direct nursing care and the cost of physical resources such as ICU and rehabilitation, as well as convalescent care beds [ 69 ].

Another key knowledge gap that thwarts a more meaningful prognosis is the lack of data on PROs. Studies have paid considerable attention to frailty as an important preoperative risk indicator for COs [ 15 , 61 ]; similar studies for PROs are few. Data on cognitive outcomes, functional outcomes, and quality of life are lacking. In our systematic review, only 11 of 38 studies reported the effects of frailty on the PROs (e.g., quality of life, ODI, and pain); the multidimensional health status of patients was reported in just six studies [ 13 , 29 , 40 , 47 , 50 , 54 ]. The wide variety of outcome measures limited the comparison of results among the included studies. The meta-analysis revealed that frailty was not significantly associated with the postoperative ODI and changes in the perioperative ODI; however, it had a conflicting relationship with the COs. Specifically, compared to non-frail patients, frail patients experienced greater improvements in ODI, quality of life, and pain [ 47 ]. Such improvements are partly explained by corrections in postural deformity, as frail patients have worse preoperative sagittal imbalances than those who do not [ 70 , 71 ]. When choosing the best treatment options for patients with degenerative spinal diseases, it is necessary to consider their preferences and values [ 72 , 73 ]. Frailty assessment can help patients and their families make informed decisions before surgery. It highlights the need for future studies to determine the association between frailty and PROs in patients with degenerative spinal disease.

We identified the typologies of postoperative health-related outcomes associated with preoperative frailty in patients who underwent spine surgery for degenerative spinal disease. These typologies can inform the content and structure of pre-rehabilitation and customized educational programs for patients undergoing spine surgery. They can also be used as basic data for implementing programs or pathways to reverse frailty in patients with spinal diseases and improve their health-related outcomes. Furthermore, the identified typologies can help develop evaluation tools to evaluate frailty-associated health-related outcomes in patients undergoing spine and other surgeries.

Finally, frailty is an important prognostic marker for postoperative health-related outcomes in patients with degenerative spinal disease, but there is a lack of consensus on the best means to accurately and efficiently determine frailty in patients undergoing spine surgery. In this review and meta-analysis, 10 different frailty instruments (including the mFI-5, mFI-11, and ASD-FI) were used to define frailty, and the variability in the evaluations by the same tool was demonstrated. A review of 14 different tools used for the assessment of frailty in a population undergoing spine surgery (age: >18 years) revealed wide variabilities in the tool components, time required to complete the assessment, and efficacy of outcome prediction among the tools [ 74 ]. Furthermore, significant heterogeneity was observed among the tools with respect to the cut-off values for risk establishment and stratification. In acute care hospitals, it is difficult to determine the most suitable tool for clinical practice. Future studies must prospectively validate frailty tools to confirm their effectiveness and applicability as reliable risk-stratification tools for the diagnosis of frailty among patients with degenerative spinal disease.

This study has some limitations. First, a meta-analysis of some items could not be performed due to data heterogeneity. Specifically, although all patients underwent spine surgery, the severity of the surgery differed among the studies because of a mixture of fusion and decompression. Furthermore, the detection of COs differed due to a mixture of prospective and retrospective studies. There were inconsistencies among the studies in the definition of frailty and the scales used for frailty analysis. Furthermore, there was heterogeneity among the frailty tools used. Second, only less than half of the included studies were included in the meta-analyses due to insufficient data (e.g., some studies reported only comparing ratios; for the same patient in the same database, only the first studies published first were considered). Third, because there were few than 10 studies in our meta-analysis, we could not identify or evaluate publication bias.

The number of patients undergoing spine surgery for degenerative spinal diseases is increasing. Thus, despite the aforementioned limitations, our study is of high clinical value because it evaluated the effects of frailty on the health-related outcomes of these patients. Our findings can guide future studies and aid healthcare professionals who treat patients with degenerative spinal diseases.

This systematic review and meta-analysis identified frailty as a strong predictor of COs in patients after spine surgery; however, preoperative frailty and PROs are still inconclusive. Further studies are needed to investigate the association between frailty and PROs. With the increasing number of frail patients undergoing spine surgery for degenerative spinal diseases, healthcare professionals should be aware of the effects of frailty and develop improved and focused perioperative management strategies for stratified frail patients. In particular, the development of interventions comprising treatment goals and plans that consider preoperative frailty as a risk factor for mortality and poor functional recovery can be an important cornerstone of preoperative management. Future research should focus on the development and implementation of interventions that could potentially improve postoperative cognitive, functional, and adverse outcomes in frail patients undergoing spine surgery.

Data Availability

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Abbreviations

Patient-reported outcome

Frailty index

Modified frailty index

Adult spinal deformity

Cervical deformity

The Risk of Bias Assessment Tool for Nonrandomized Studies

Confidence interval

Mean difference

Japanese Orthopaedic Association

Oswestry Disability Questionnaire

36-Item Short Form Survey

Yolcu YU, Helal A, Alexander AY, Bhatti AU, Alvi MA, Abode-Iyamah K, Bydon M. Minimally invasive Versus Open Surgery for degenerative spine disorders for Elderly patients: experiences from a single Institution. World Neurosurg. 2021;146:e1262–9. https://doi.org/10.1016/j.wneu.2020.11.145 .

Article   PubMed   Google Scholar  

Martin BI, Mirza SK, Spina N, Spiker WR, Lawrence B, Brodke DS. Trends in lumbar Fusion Procedure Rates and Associated Hospital costs for degenerative spinal Diseases in the United States, 2004 to 2015. Spine (Phila Pa 1976). 2019;44(5):369–76. https://doi.org/10.1097/brs.0000000000002822 .

Beschloss A, Dicindio C, Lombardi J, Varthi A, Ozturk A, Lehman R, Lenke L, Saifi C. Marked increase in spinal deformity Surgery throughout the United States. Spine (Phila Pa 1976). 2021;46(20):1402–8. https://doi.org/10.1097/brs.0000000000004041 .

Kobayashi K, Ando K, Nishida Y, Ishiguro N, Imagama S. Epidemiological trends in spine Surgery over 10 years in a multicenter database. Eur Spine J. 2018;27(8):1698–703. https://doi.org/10.1007/s00586-018-5513-4 .

Neifert SN, Martini ML, Yuk F, McNeill IT, Caridi JM, Steinberger J, Oermann EK. Predicting trends in cervical spinal Surgery in the United States from 2020 to 2040. World Neurosurg. 2020;141:e175–81. https://doi.org/10.1016/j.wneu.2020.05.055 .

Puvanesarajah V, Jain A, Kebaish K, Shaffrey CI, Sciubba DM, De la Garza-Ramos R, Khanna AJ, Hassanzadeh H. Poor Nutrition status and lumbar Spine Fusion Surgery in the Elderly: readmissions, Complications, and Mortality. Spine (Phila Pa 1976). 2017;42(13):979–83. https://doi.org/10.1097/brs.0000000000001969 .

Chan V, Witiw CD, Wilson JR, Wilson JR, Coyte P, Fehlings MG. Frailty is an important predictor of 30-day morbidity in patients treated for lumbar spondylolisthesis using a posterior surgical approach. Spine J. 2021. https://doi.org/10.1016/j.spinee.2021.08.008 .

Hirase T, Haghshenas V, Bratescu R, Dong D, Kuo PH, Rashid A, Kavuri V, Hanson DS, Meyer BC, Marco RAW. Sarcopenia predicts perioperative adverse events following complex revision Surgery for the thoracolumbar spine. Spine J. 2021;21(6):1001–9. https://doi.org/10.1016/j.spinee.2021.02.001 .

Hanna K, Ditillo M, Joseph B. The role of frailty and prehabilitation in Surgery. Curr Opin Crit Care. 2019;25(6):717–22. https://doi.org/10.1097/mcc.0000000000000669 .

Leven DM, Lee NJ, Kothari P, Steinberger J, Guzman J, Skovrlj B, Shin JI, Caridi JM, Cho SK. Frailty Index is a significant predictor of Complications and mortality after Surgery for adult spinal deformity. Spine (Phila Pa 1976). 2016;41(23):E1394–e1401. https://doi.org/10.1097/brs.0000000000001886 .

Susano MJ, Grasfield RH, Friese M, Rosner B, Crosby G, Bader AM, Kang JD, Smith TR, Lu Y, Groff MW, et al. Brief preoperative screening for Frailty and Cognitive Impairment predicts delirium after spine Surgery. Anesthesiology. 2020;133(6):1184–91. https://doi.org/10.1097/aln.0000000000003523 .

Article   CAS   PubMed   Google Scholar  

Charest-Morin R, Street J, Zhang H, Roughead T, Ailon T, Boyd M, Dvorak M, Kwon B, Paquette S, Dea N, et al. Frailty and Sarcopenia do not predict adverse events in an elderly population undergoing non-complex primary elective Surgery for degenerative conditions of the lumbar spine. Spine J. 2018;18(2):245–54. https://doi.org/10.1016/j.spinee.2017.07.003 .

Brown AE, Lebovic J, Alas H, Pierce KE, Bortz CA, Ahmad W, Naessig S, Hassanzadeh H, Labaran LA, Puvanesarajah V, et al. A cost utility analysis of treating different adult spinal deformity frailty states. J Clin Neurosci. 2020;80:223–8. https://doi.org/10.1016/j.jocn.2020.07.047 .

Hannah TC, Neifert SN, Caridi JM, Martini ML, Lamb C, Rothrock RJ, Yuk FJ, Gilligan J, Genadry L, Gal JS. Utility of the Hospital Frailty Risk score for Predicting adverse outcomes in degenerative spine Surgery cohorts. Neurosurgery. 2020;87(6):1223–30. https://doi.org/10.1093/neuros/nyaa248 .

Shaw JF, Budiansky D, Sharif F, McIsaac DI. The Association of Frailty with outcomes after Cancer Surgery: a systematic review and metaanalysis. Ann Surg Oncol. 2022. https://doi.org/10.1245/s10434-021-11321-2 .

Li B, Meng X, Zhang X, Hai Y. Frailty as a risk factor for postoperative Complications in adult patients with degenerative scoliosis administered posterior single approach, long-segment corrective Surgery: a retrospective cohort study. BMC Musculoskelet Disord. 2021;22(1):333. https://doi.org/10.1186/s12891-021-04186-9 .

Article   PubMed   PubMed Central   Google Scholar  

Chan V, Wilson JRF, Ravinsky R, Badhiwala JH, Jiang F, Anderson M, Yee A, Wilson JR, Fehlings MG. Frailty adversely affects outcomes of patients undergoing spine Surgery: a systematic review. Spine J. 2021;21(6):988–1000. https://doi.org/10.1016/j.spinee.2021.01.028 .

Handforth C, Clegg A, Young C, Simpkins S, Seymour MT, Selby PJ, Young J. The prevalence and outcomes of frailty in older cancer patients: a systematic review. Ann Oncol. 2015;26(6):1091–101. https://doi.org/10.1093/annonc/mdu540 .

Dai S, Yang M, Song J, Dai S, Wu J. Impacts of Frailty on Prognosis in Lung Cancer patients: a systematic review and Meta-analysis. Front Med (Lausanne). 2021;8:715513. https://doi.org/10.3389/fmed.2021.715513 .

Fried LP, Tangen CM, Walston J, Newman AB, Hirsch C, Gottdiener J, Seeman T, Tracy R, Kop WJ, Burke G, et al. Frailty in older adults: evidence for a phenotype. J Gerontol A Biol Sci Med Sci. 2001;56(3):M146–156. https://doi.org/10.1093/gerona/56.3.m146 .

Rockwood K, Song X, MacKnight C, Bergman H, Hogan DB, McDowell I, Mitnitski A. A global clinical measure of fitness and frailty in elderly people. CMAJ. 2005;173(5):489–95. https://doi.org/10.1503/cmaj.050051 .

Miller EK, Ailon T, Neuman BJ, Klineberg EO, Mundis GM Jr., Sciubba DM, Kebaish KM, Lafage V, Scheer JK, Smith JS, et al. Assessment of a Novel Adult Cervical deformity Frailty Index as a component of preoperative risk stratification. World Neurosurg. 2018;109:e800–6. https://doi.org/10.1016/j.wneu.2017.10.092 .

Ali R, Schwalb JM, Nerenz DR, Antoine HJ, Rubinfeld I. Use of the modified frailty index to predict 30-day morbidity and mortality from spine Surgery. J Neurosurg Spine. 2016;25(4):537–41. https://doi.org/10.3171/2015.10.Spine14582 .

Shahrestani S, Ton A, Chen XT, Ballatori AM, Wang JC, Buser Z. The influence of frailty on postoperative Complications in geriatric patients receiving single-level lumbar fusion Surgery. Eur Spine J. 2021;30(12):3755–62. https://doi.org/10.1007/s00586-021-06960-8 .

Higgins JP, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA. Cochrane handbook for systematic reviews of interventions version 6.3. (updated Feburary 2022). In.: Cochrane Handbook; 2022.

Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan-a web and mobile app for systematic reviews. Syst Rev. 2016;5(1):210. https://doi.org/10.1186/s13643-016-0384-4 .

Kim SY, Park JE, Lee YJ, Seo HJ, Sheen SS, Hahn S, Jang BH, Son HJ. Testing a tool for assessing the risk of bias for nonrandomized studies showed moderate reliability and promising validity. J Clin Epidemiol. 2013;66(4):408–14. https://doi.org/10.1016/j.jclinepi.2012.09.016 .

Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557–60. https://doi.org/10.1136/bmj.327.7414.557 .

Passias PG, Bortz CA, Segreto FA, Horn SR, Lafage R, Lafage V, Smith JS, Line B, Kim HJ, Eastlack R, et al. Development of a modified cervical deformity Frailty Index: a Streamlined Clinical Tool for Preoperative Risk Stratification. Spine (Phila Pa 1976). 2019;44(3):169–76. https://doi.org/10.1097/brs.0000000000002778 .

Chang SY, Son J, Park SM, Chang BS, Lee CK, Kim H. Predictive Value of Comprehensive Geriatric Assessment on early postoperative Complications following lumbar spinal stenosis Surgery: a prospective cohort study. Spine (Phila Pa 1976). 2020;45(21):1498–505. https://doi.org/10.1097/brs.0000000000003597 .

Elsamadicy AA, Freedman IG, Koo AB, David WB, Reeves BC, Havlik J, Pennington Z, Kolb L, Shin JH, Sciubba DM. Modified-frailty index does not independently predict Complications, hospital length of stay or 30-day readmission rates following posterior lumbar decompression and fusion for spondylolisthesis. Spine J. 2021;21(11):1812–21. https://doi.org/10.1016/j.spinee.2021.05.011 .

Flexman AM, Charest-Morin R, Stobart L, Street J, Ryerson CJ. Frailty and postoperative outcomes in patients undergoing Surgery for degenerative spine Disease. Spine J. 2016;16(11):1315–23. https://doi.org/10.1016/j.spinee.2016.06.017 .

Jung JM, Chung CK, Kim CH, Yang SH, Ko YS. The modified 11-Item Frailty Index and postoperative outcomes in patients undergoing lateral lumbar Interbody Fusion. Spine (Phila Pa 1976). 2022;47(5):396–404. https://doi.org/10.1097/brs.0000000000004260 .

Kang T, Park SY, Lee JS, Lee SH, Park JH, Suh SW. Predicting postoperative Complications in patients undergoing lumbar spinal fusion by using the modified five-item frailty index and nutritional status. Bone Joint J. 2020;102–b(12):1717–22. https://doi.org/10.1302/0301-620x.102b12.Bjj-2020-0874.R1 .

Kim JY, Park IS, Kang DH, Lee YS, Kim KT, Hong SJ. Prediction of risk factors after spine Surgery in patients aged > 75 years using the modified Frailty Index. J Korean Neurosurg Soc. 2020;63(6):827–33. https://doi.org/10.3340/jkns.2020.0019 .

Kim DU, Park HK, Lee GH, Chang JC, Park HR, Park SQ, Cho SJ. Central Sarcopenia, Frailty and Comorbidity as Predictor of Surgical Outcome in Elderly patients with degenerative spine Disease. J Korean Neurosurg Soc. 2021;64(6):995–1003. https://doi.org/10.3340/jkns.2021.0074 .

Miller EK, Neuman BJ, Jain A, Daniels AH, Ailon T, Sciubba DM, Kebaish KM, Lafage V, Scheer JK, Smith JS, et al. An assessment of frailty as a tool for risk stratification in adult spinal deformity Surgery. Neurosurg Focus. 2017;43(6):E3. https://doi.org/10.3171/2017.10.Focus17472 .

Miller EK, Vila-Casademunt A, Neuman BJ, Sciubba DM, Kebaish KM, Smith JS, Alanay A, Acaroglu ER, Kleinstück F, Obeid I, et al. External validation of the adult spinal deformity (ASD) frailty index (ASD-FI). Eur Spine J. 2018;27(9):2331–8. https://doi.org/10.1007/s00586-018-5575-3 .

Miller EK, Lenke LG, Neuman BJ, Sciubba DM, Kebaish KM, Smith JS, Qiu Y, Dahl BT, Pellisé F, Matsuyama Y, et al. External validation of the adult spinal deformity (ASD) Frailty Index (ASD-FI) in the Scoli-RISK-1 patient database. Spine (Phila Pa 1976). 2018;43(20):1426–31. https://doi.org/10.1097/brs.0000000000002717 .

Passias PG, Moattari K, Pierce KE, Passfall L, Krol O, Naessig S, Ahmad W, Schoenfeld AJ, Ahmad S, Singh V, et al. Performance of the modified adult spinal deformity Frailty Index (mASD-FI) in Preoperative Risk Assessment. Spine (Phila Pa 1976). 2022. https://doi.org/10.1097/brs.0000000000004342 .

Phan K, Kim JS, Lee NJ, Somani S, Di Capua J, Kothari P, Leven D, Cho SK. Frailty is associated with morbidity in adults undergoing elective anterior lumbar interbody fusion (ALIF) Surgery. Spine J. 2017;17(4):538–44. https://doi.org/10.1016/j.spinee.2016.10.023 .

Pierce KE, Passias PG, Alas H, Brown AE, Bortz CA, Lafage R, Lafage V, Ames C, Burton DC, Hart R, et al. Does patient Frailty Status Influence Recovery following spinal Fusion for adult spinal deformity? An analysis of patients with 3-Year follow-up. Spine (Phila Pa 1976). 2020;45(7):E397–e405. https://doi.org/10.1097/brs.0000000000003288 .

Pierce KE, Passias PG, Daniels AH, Lafage R, Ahmad W, Naessig S, Lafage V, Protopsaltis T, Eastlack R, Hart R, et al. Baseline Frailty Status influences recovery patterns and outcomes following alignment correction of cervical deformity. Neurosurgery. 2021;88(6):1121–7. https://doi.org/10.1093/neuros/nyab039 .

Pierce KE, Naessig S, Kummer N, Larsen K, Ahmad W, Passfall L, Krol O, Bortz C, Alas H, Brown A, et al. The five-item modified Frailty Index is predictive of 30-day postoperative Complications in patients undergoing spine Surgery. Spine (Phila Pa 1976). 2021;46(14):939–43. https://doi.org/10.1097/brs.0000000000003936 .

Pierce KE, Kapadia BH, Bortz C, Alas H, Brown AE, Diebo BG, Raman T, Jain D, Lebovic J, Passias PG. Frailty Severity impacts Development of Hospital-acquired conditions in patients undergoing corrective Surgery for adult spinal deformity. Clin Spine Surg. 2021;34(7):E377–e381. https://doi.org/10.1097/bsd.0000000000001219 .

Pulido LC, Meyer M, Reinhard J, Kappenschneider T, Grifka J, Weber M. Hospital frailty risk score predicts adverse events in spine Surgery. Eur Spine J. 2022;31(7):1621–9. https://doi.org/10.1007/s00586-022-07211-0 .

Reid DBC, Daniels AH, Ailon T, Miller E, Sciubba DM, Smith JS, Shaffrey CI, Schwab F, Burton D, Hart RA, et al. Frailty and Health-Related Quality of Life Improvement following adult spinal deformity Surgery. World Neurosurg. 2018;112:e548–54. https://doi.org/10.1016/j.wneu.2018.01.079 .

Rothrock RJ, Steinberger JM, Badgery H, Hecht AC, Cho SK, Caridi JM, Deiner S. Frailty status as a predictor of 3-month cognitive and functional recovery following spinal Surgery: a prospective pilot study. Spine J. 2019;19(1):104–12. https://doi.org/10.1016/j.spinee.2018.05.026 .

Shin JI, Kothari P, Phan K, Kim JS, Leven D, Lee NJ, Cho SK. Frailty Index as a predictor of adverse postoperative outcomes in patients undergoing cervical spinal Fusion. Spine (Phila Pa 1976). 2017;42(5):304–10. https://doi.org/10.1097/brs.0000000000001755 .

Sun W, Lu S, Kong C, Li Z, Wang P, Zhang S. Frailty and post-operative outcomes in the older patients undergoing elective posterior Thoracolumbar Fusion Surgery. Clin Interv Aging. 2020;15:1141–50. https://doi.org/10.2147/cia.S245419 .

Ton A, Shahrestani S, Saboori N, Ballatori AM, Chen XT, Wang JC, Buser Z. The impact of frailty on postoperative Complications in geriatric patients undergoing multi-level lumbar fusion Surgery. Eur Spine J. 2022;31(7):1745–53. https://doi.org/10.1007/s00586-022-07237-4 .

Weaver DJ, Malik AT, Jain N, Yu E, Kim J, Khan SN. The modified 5-Item Frailty Index: a concise and useful Tool for assessing the impact of Frailty on postoperative morbidity following elective posterior lumbar fusions. World Neurosurg. 2019. https://doi.org/10.1016/j.wneu.2018.12.168 .

Wilson JRF, Badhiwala JH, Moghaddamjou A, Yee A, Wilson JR, Fehlings MG. Frailty is a better predictor than age of mortality and perioperative Complications after Surgery for degenerative cervical myelopathy: an analysis of 41,369 patients from the NSQIP database 2010–2018. J Clin Med. 2020;9(11). https://doi.org/10.3390/jcm9113491 .

Yagi M, Fujita N, Okada E, Tsuji O, Nagoshi N, Tsuji T, Asazuma T, Nakamura M, Matsumoto M, Watanabe K. Impact of Frailty and Comorbidities on Surgical outcomes and Complications in adult spinal disorders. Spine (Phila Pa 1976). 2018;43(18):1259–67. https://doi.org/10.1097/brs.0000000000002596 .

Yagi M, Michikawa T, Hosogane N, Fujita N, Okada E, Suzuki S, Tsuji O, Nagoshi N, Asazuma T, Tsuji T, et al. The 5-Item modified Frailty Index is predictive of severe adverse events in patients undergoing Surgery for adult spinal deformity. Spine (Phila Pa 1976). 2019;44(18):E1083–e1091. https://doi.org/10.1097/brs.0000000000003063 .

Zreik J, Alvi MA, Yolcu YU, Sebastian AS, Freedman BA, Bydon M. Utility of the 5-Item modified Frailty Index for Predicting adverse outcomes following elective Anterior Cervical Discectomy and Fusion. World Neurosurg. 2021;146:e670–7. https://doi.org/10.1016/j.wneu.2020.10.154 .

Dindo D, Demartines N, Clavien PA. Classification of Surgical Complications: a new proposal with evaluation in a cohort of 6336 patients and results of a survey. Ann Surg. 2004;240(2):205–13. https://doi.org/10.1097/01.sla.0000133083.54934.ae .

Glassman SD, Hamill CL, Bridwell KH, Schwab FJ, Dimar JR, Lowe TG. The impact of perioperative Complications on clinical outcome in adult deformity Surgery. Spine (Phila Pa 1976). 2007;32(24):2764–70. https://doi.org/10.1097/BRS.0b013e31815a7644 .

Glassman SD, Alegre G, Carreon L, Dimar JR, Johnson JR. Perioperative Complications of lumbar instrumentation and fusion in patients with Diabetes Mellitus. Spine J. 2003;3(6):496–501. https://doi.org/10.1016/s1529-9430(03)00426-1 .

Lee JA, Yanagawa B, An KR, Arora RC, Verma S, Friedrich JO. Frailty and pre-frailty in cardiac Surgery: a systematic review and meta-analysis of 66,448 patients. J Cardiothorac Surg. 2021;16(1):184. https://doi.org/10.1186/s13019-021-01541-8 .

Bezzina K, Fehlmann CA, Guo MH, Visintini SM, Rubens FD, Wells GA, Mazzola R, McGuinty C, Huang A, Khoury L, et al. Influence of preoperative frailty on quality of life after cardiac Surgery: protocol for a systematic review and meta-analysis. PLoS ONE. 2022;17(2):e0262742. https://doi.org/10.1371/journal.pone.0262742 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Alkadri J, Hage D, Nickerson LH, Scott LR, Shaw JF, Aucoin SD, McIsaac DI. Anesth Analg. 2021;133(5):1094–106. https://doi.org/10.1213/ane.0000000000005595 . A Systematic Review and Meta-Analysis of Preoperative Frailty Instruments Derived From Electronic Health Data.

Kennedy CA, Shipway D, Barry K. Frailty and emergency abdominal Surgery: a systematic review and meta-analysis. Surgeon. 2021. https://doi.org/10.1016/j.surge.2021.11.009 .

Nidadavolu LS, Ehrlich AL, Sieber FE, Oh ES. Preoperative evaluation of the Frail patient. Anesth Analg. 2020;130(6):1493–503. https://doi.org/10.1213/ane.0000000000004735 .

Gill TM, Baker DI, Gottschalk M, Gahbauer EA, Charpentier PA, de Regt PT, Wallace SJ. A prehabilitation program for physically frail community-living older persons. Arch Phys Med Rehabil. 2003;84(3):394–404. https://doi.org/10.1053/apmr.2003.50020 .

Clegg A, Young J, Iliffe S, Rikkert MO, Rockwood K. Frailty in elderly people. Lancet. 2013;381(9868):752–62. https://doi.org/10.1016/s0140-6736(12)62167-9 .

Picca A, Coelho-Junior HJ, Calvani R, Marzetti E, Vetrano DL. Biomarkers shared by frailty and sarcopenia in older adults: a systematic review and meta-analysis. Ageing Res Rev. 2022;73:101530. https://doi.org/10.1016/j.arr.2021.101530 .

da Silva VD, Tribess S, Meneguci J, Sasaki JE, Garcia-Meneguci CA, Carneiro JAO, Virtuoso JS. Jr. Association between frailty and the combination of physical activity level and sedentary behavior in older adults. BMC Public Health. 2019;19(1):709. https://doi.org/10.1186/s12889-019-7062-0 .

Apóstolo J, Cooke R, Bobrowicz-Campos E, Santana S, Marcucci M, Cano A, Vollenbroek-Hutten M, Germini F, D’Avanzo B, Gwyther H, et al. Effectiveness of interventions to prevent pre-frailty and frailty progression in older adults: a systematic review. JBI Database System Rev Implement Rep. 2018;16(1):140–232. https://doi.org/10.11124/jbisrir-2017-003382 .

Yoshida G, Boissiere L, Larrieu D, Bourghli A, Vital JM, Gille O, Pointillart V, Challier V, Mariey R, Pellisé F, et al. Advantages and disadvantages of adult spinal deformity Surgery and its impact on Health-Related Quality of Life. Spine (Phila Pa 1976). 2017;42(6):411–9. https://doi.org/10.1097/brs.0000000000001770 .

Blondel B, Schwab F, Ungar B, Smith J, Bridwell K, Glassman S, Shaffrey C, Farcy JP, Lafage V. Impact of magnitude and percentage of global sagittal plane correction on health-related quality of life at 2-years follow-up. Neurosurgery. 2012;71(2):341–8. https://doi.org/10.1227/NEU.0b013e31825d20c0 . discussion 348.

Smith MA. The Role of Shared decision making in patient-centered care and Orthopaedics. Orthop Nurs. 2016;35(3):144–9. https://doi.org/10.1097/nor.0000000000000243 .

Charles C, Whelan T, Gafni A. What do we mean by partnership in making decisions about treatment? BMJ. 1999;319(7212):780–2. https://doi.org/10.1136/bmj.319.7212.780 .

Moskven E, Charest-Morin R, Flexman AM, Street JT. The measurements of frailty and their possible application to spinal conditions: a systematic review. Spine J. 2022. https://doi.org/10.1016/j.spinee.2022.03.014 .

Download references

Acknowledgements

We are grateful to Euna Ju of the research staff for supporting this study.

This work was supported by the National Research Foundation of Korea grant to WB, which is funded by the Korea government [Ministry of Science and ICT; grant number NRF-2021R1G1A1093450].

Author information

Authors and affiliations.

College of Nursing, Gyeongsang National University, Jinju-si, Gyeongsangnam-do, South Korea

Wonhee Baek

College of Nursing, Daegu Catholic University, Daegu-si, South Korea

Sun-Young Park

Department of Nursing, College of Healthcare Sciences, Far East University, Eumseong-gun, Chungcheongbuk-do, South Korea

Yoonjoo Kim

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization, Methodology, Formal analysis, Investigation: WB, YK; Software, Visualization: WB; Writing—Original draft: WB, YK; Writing—reviewing & Editing: WB, YK, SP; Supervision: YK; Funding acquisition: WB.

Corresponding author

Correspondence to Yoonjoo Kim .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Baek, W., Park, SY. & Kim, Y. Impact of frailty on the outcomes of patients undergoing degenerative spine surgery: a systematic review and meta-analysis. BMC Geriatr 23 , 771 (2023). https://doi.org/10.1186/s12877-023-04448-2

Download citation

Received : 17 October 2022

Accepted : 01 November 2023

Published : 23 November 2023

DOI : https://doi.org/10.1186/s12877-023-04448-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Meta-analysis
  • Patient-reported outcome measures
  • Spine surgery
  • Systematic review

BMC Geriatrics

ISSN: 1471-2318

analysis of ehr data for clinical research

Search Thermo Fisher Scientific

  • Order Status
  • Quick Order
  • Check Order Status
  • Aspire Member Program
  • Connect: Lab, Data, Apps
  • Custom Products & Projects
  • Instrument Management

Deliver reliable test results with greater confidence and ease

Our portfolio of software applications aid in the analysis and evaluation of data generated with our MICA, KIR, HLA typing, and antibody detection tests. Assisting with the assignment of typing and antibody detection results, our software further increases testing efficiency.

Software that increases testing efficiency

analysis of ehr data for clinical research

HLA analysis software

analysis of ehr data for clinical research

HistoTrac services

analysis of ehr data for clinical research

Streamline data access with our software

Histotrac lab system.

The HistoTrac software is a widely used tool in transplant laboratories, organ procurement organizations and blood centers for donor and patient management.

HistoTrac modules

Explore our modules and choose the configuration that suits your laboratory needs.

We provide an array of services that facilitate the implementation process and help develop your total HistoTrac system.

Our portfolio of software applications aid in the analysis and evaluation of data generated with our MICA, KIR, HLA typing, and Antibody detection tests. Assisting with the assignment of typing and antibody detection results, our software further increases testing efficiency.

HLA Fusion software

Powerful analysis modules are specifically designed to work with One Lambda molecular typing and antibody screening products.

HLA Fusion research softwar

Used for easy analysis of molecular typing and antibody screening results. Designed to support One Lambda Research products.

SureTyper™ (for HLA)

Assess LinkSēq HLA products with this software that is simple, fast, automated, and requires minimal training.

SureTyper™ Blood

Assess LinkSēq blood products with this software that is simple, fast, automated, and requires minimal training.

TypeStream Visual NGS Analysis

Provides automated and streamlined analysis of One Lambda NGS products with a wide range of analytical tools, run statistics and quality metrics to facilitate examination and reinforce decision making.

uTYPE Dx/HLA Sequence Analysis

Used for easy analysis of SeCore HLA sequence-based typing (SBT) products.

Not all products are CE marked or have 510(k) clearance for sale in the U.S. Availability of products in each country depends on local regulatory marketing authorization status.

Lead EMR Analyst - Epic Research

Job posting for lead emr analyst - epic research at cincinnati children's hospital.

Expected Starting Salary Range: 47.04 - 60.09

SUBFUNCTION DEFINITION:

Join our dynamic team as a Lead Epic Analyst on our Clinical Research Informatics team, where you'll be leveraging technology to support advancement of our clinical research management functions in Epic. As a key member of our team, you will support access to tools in Epic to address compliance Risk, increase transparency and ultimately Safety, support Operational Efficiency & Effectiveness, and enhance Recruitment of research participants. This role is ideal for individuals with a passion for collaborating with researchers at the intersection of research, health care, and technology to help accelerate discoveries and improve health outcomes.

REPRESENTATIVE RESPONSIBILITIES

What You'll Do:

As a Lead EMR Analyst, you will utilize your extensive clinical research background, technical expertise, and leadership skills to drive the advancement of Epic Research and clinical trial management initiatives. You will take on a strategic and impactful role within the team, guiding and mentoring junior analysts while spearheading key projects and initiatives. You'll work collaboratively with the team, taking on a range of assignments related to new user training, system updates, and maintenance. You'll have the opportunity to lead projects driving the expansion of Epic Research tools, technical support, and ensuring exceptional customer experiences.

  • Ensure the integrity and optimal functionality of technical tools, addressing technical issues, implementing updates, and optimizing system tools.
  • Configure Epic Research tools to align with the specific needs of research initiatives.
  • Utilize the development lifecycle process, operating procedures, and documentation for implementing and supporting system solutions.
  • Collaborate on strategic planning to support research enterprise software/technology needs.
  • Mentor and assist team members, actively participating in the team's vision and direction.
  • Contribute to short-range and long-range planning for the team.
  • Lead activities of personnel for assigned projects and perform supervisory tasks, including evaluations for direct reports.
  • Engage in meetings, workshops, and committees for strategic planning related to Epic research initiatives.
  • Contribute to departmental process improvement efforts.
  • Strategize with end users, vendors, and internal colleagues to ensure research applications support community needs.
  • Identify opportunities for process improvement and optimization within Epic Research workflows.
  • Lead the development and implementation of Epic Research tools and functionalities, leveraging understanding of clinical research workflows and regulatory requirements.
  • Develop and manage project plans and related documentation.
  • Manage the team portfolio to ensure balance.
  • Oversee the development and validation requirements for system solutions with the research community.
  • Collect information for future systems development and feasibility studies.
  • Evaluate current systems for quality and utilization.
  • Serve as a subject matter expert on Epic research functionality.

Apply for this job

Receive alerts for other Lead EMR Analyst - Epic Research job openings

Report this Job

Popular Search Topics

Sign up to receive alerts about other jobs with skills like those required for the lead emr analyst - epic research ..

Click the checkbox next to the jobs that you are interested in.

Change Management Skill

  • Patient Experience Supervisor Income Estimation: $93,779 - $123,389
  • IT Project Manager II Income Estimation: $94,417 - $120,146

Clinical Data Analysis Skill

  • Biostatistician II Income Estimation: $82,068 - $108,378
  • Scientist - Clinical Research Income Estimation: $83,286 - $118,409

Job openings at Cincinnati Children's Hospital

Not the job you're looking for here are some other lead emr analyst - epic research jobs in the cincinnati, oh area that may be a better fit., we don't have any other lead emr analyst - epic research jobs in the cincinnati, oh area right now..

Analyst I - EMR - Epic Billing

Cincinnati Children's Hospital Medical Center , Cincinnati, OH

Epic Hospital Billing Analyst

St. Elizabeth Healthcare , Crestview, KY

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • AMIA Annu Symp Proc
  • v.2012; 2012

A Qualitative Analysis of EHR Clinical Document Synthesis by Clinicians

Oladimeji farri.

1 Institute for Health Informatics,

David S. Pieckiewicz

Ahmed s. rahman, terrence j. adam.

2 College of Pharmacy, and

Serguei V. Pakhomov

Genevieve b. melton.

3 Department of Surgery, University of Minnesota, Minneapolis

Clinicians utilize electronic health record (EHR) systems during time-constrained patient encounters where large amounts of clinical text must be synthesized at the point of care. Qualitative methods may be an effective approach for uncovering cognitive processes associated with the synthesis of clinical documents within EHR systems. We utilized a think-aloud protocol and content analysis with the goal of understanding cognitive processes and barriers involved as medical interns synthesized patient clinical documents in an EHR system to accomplish routine clinical tasks. Overall, interns established correlations of significance and meaning between problem, symptom and treatment concepts to inform hypotheses generation and clinical decision-making. Barriers identified with synthesizing EHR documents include difficulty searching for patient data, poor readability, redundancy, and unfamiliar specialized terms. Our study can inform recommendations for future designs of EHR clinical document user interfaces to aid clinicians in providing improved patient care.

1. Introduction

The transition from paper-based media to electronic health record (EHR) systems, supported by recent national mandates for the implementation of health information technology (HIT), provides unprecedented access to vast amounts of diverse clinical data at the point of care. However, clinicians are often challenged by the ‘disconnect’ between current implementations of EHR systems and the complexities of clinical decision-making, including the organization of text-based clinical information within these systems.

Medical cognitive science emphasizes the complex nature of clinical reasoning and the significance of knowledge representation in medical decision-making. An ongoing range of cognitive processes are utilized by clinicians in constructing mental models that aptly reflect clinical scenarios and assist in making effective clinical decisions ( 1 ).

Over the last decade, an important focus of informatics research has been the development and evaluation of EHR user interfaces such that they are equipped to adequately satisfy clinicians’ information needs to effectively reduce the cognitive load of information retrieval and improve the learning process involved in using these systems ( 2 – 4 ). Improving our understanding of clinicians’ cognitive processes, specifically those surrounding the use of text-based EHR clinical documents, could improve user-centered cognitive models and aid the design of clinical document user interfaces ( 5 , 6 ). The objective of this study was to gain insight into cognitive processes of clinicians as they synthesize information from an EHR prototype, specifically concentrating on the use of text-based clinical documents as primary data sources.

2. Background

2.1. information overload within ehr systems.

In clinical practice, complex data processing remains an integral aspect of problem-solving strategies utilized by experts, sub-experts and novices alike ( 6 )( 7 ). Timely access to patient information relevant to routine and emergency clinical processes determines the clinician’s familiarity with clinical concepts and the context of clinical situations. EHR implementations provide clinicians with rich and extensive patient-specific information from a large number of sources in multiple and different formats ( 8 )( 9 ). Narratives (free text) recorded in the EHR by clinicians (physicians, nurses, and other healthcare providers) as they care for patients are contained in documents that have significant information over and above the structured nature of data such as vital signs and laboratory results. However, the quantity of information within these documents can be overwhelming, thereby posing cognitive challenges to clinicians reading and using these documents. Therefore, the issue with information accessibility at the point of care is transforming the balance of information from ‘having too little’ to ‘having too much’ ( 10 ).

One factor responsible for excessive information within EHR clinical documents is the frequent and often indiscriminate ‘copying and pasting’ of redundant patient information in an attempt to accurately capture pertinent details from previous clinical encounters and provide sufficient information for billing purposes. Despite the time-savings and administrative benefits facilitated by the ‘stand-alone’ clinical encounter documents, transferring information from one clinical document to another may propagate unidentified errors which could have adverse effects on patient management ( 11 ).

2.2. Cognitive Demands at the Point of Care

Assuming a cognitive network of humans and computers as the fundamental unit of analysis, some experts draw attention to the dynamic and radical changes in a given professional or social environment following the introduction of new computer technology ( 12 ). To mitigate possible increases in cognitive demands associated with learning HIT, it is important to consider information processing techniques at the point of clinical care and the extent to which these activities can be influenced by the implementation of EHR systems.

Within the time-constrained and considerably stressful patient encounter, the clinician’s cognitive efforts are devoted to consuming relevant information from previously documented clinical encounters in order to construct a mental model representative of the patient’s situation. Constructing this mental model becomes even more taxing when an unfamiliar patient’s medical record is to be reviewed during a first-time clinical encounter.

Synthesizing EHR clinical documents requires allocation of cognitive resources to processing both novel and familiar information. According to experts, no more than two or three novel information elements can be processed adequately at any one time by the working memory (WM) - a division of the human cognitive architecture where all conscious information processing takes place. Therefore, when clinicians review multiple clinical documents associated with unfamiliar clinical scenarios, the WM likely experiences a cognitive burden that has detrimental effects on professional motivation and productivity.

Leveraging on the cognitive load theory, the cognitive load associated with reviewing large amounts of EHR clinical documents may depend on how information is presented to the clinicians and the range of actions required to access the information in a format that is easy to consume ( Figure 1 ) ( 13 , 14 ). If patient information within an EHR clinical document user interface is presented in a poorly organized fashion that warrants laborious ‘browsing’ to derive critical data, system users may experience frustrations and have reduced motivation for thoroughness, resulting in a increased propensity for erroneous clinical judgment.

An external file that holds a picture, illustration, etc.
Object name is amia_2012_symp_1211f1.jpg

Application of Cognitive Load Theory to EHR Clinical Document Synthesis

2.3. Think-Aloud Protocol

Critical thinking can be represented as sequences of thoughts or cognitive states separated by processing activities ( 15 ). The think aloud (TA) protocol, as a scientific method through which human cognitive activities can be made verbal, was first highlighted in the mid 1940s ( 16 , 17 ).The principle of the TA protocol is to obtain data in the form of verbalized statements in order to investigate cognitive processes relative to certain human activities.

Based on principles of information processing theory, the TA protocol uses simulations of problem-solving tasks to elicit verbal reports that potentially reveal and describe which information is being analyzed and how the information is structured or reconfigured within the WM during a problem-solving activity ( 18 , 19 ). Evidences that support the use of the TA protocol include the fact that (a) human cognition refers to a sequence of internal states typically transformed by information processing, (b) these sequences of internal states can be externalized through verbalizations, and (c) recently acquired information which has become the focus of an individual’s concentration can be accessed directly as verbal data ( 18 , 20 ).

Clinicians were observed as they interacted with clinical documents within a prototype EHR system. The prototype EHR system was designed based on the user interface framework of the Veterans Affairs’ computerized patient record system (VistA CPRS) and provided basic functionalities available in most EHR systems for reviewing clinical documents e.g. a read-only document viewer and lists of authored clinical documents that can be sorted by date. Clinicians were asked to verbalize their thought processes (TA protocol) while reviewing clinical documents in the context of accomplishing a set of routine clinical tasks. The think-aloud protocol audio provides qualitative data that was synchronized with the screen display and navigation on the EHR system screen captured by a video camera in a controlled environment ( 21 ) ( Figure 2 ). Also, a content analysis of the clinicians’ verbalizations while accomplishing the clinical tasks was performed. Approval for this study was obtained from the University of Minnesota Institutional Review Board.

An external file that holds a picture, illustration, etc.
Object name is amia_2012_symp_1211f2.jpg

Overview of Think Aloud Experiment

3.1. Study Sample

A purposive sample of clinical interns was recruited for our study based on similar sample sizes in studies with qualitative analysis of medical cognition and clinical decision-making ( 21 – 23 ). We restricted participation in the research to the intern level physicians in order to control for differences in cognitive processes and medical decision-making techniques due to varying clinical expertise.

3.2. Experimental Design

Each intern reviewed nine patient records from the Fairview Health Services at the University of Minnesota Medical Center. These records contained free-text documentations of eight to nine office visits related to the management of chronic medical diagnoses such as type 2 diabetes mellitus and essential hypertension. The interns reviewed the records while performing routine clinical tasks within a simulated clinical setting.

With the assistance of two experienced clinicians (GBM, TJA), we developed clinical practice scenarios requiring ongoing assessments of clinical documents within the EHR system. An example of a clinical practice scenario is given below:

Ms XXX visited the emergency department (ED) today with a 24hr. history of fever and pain in her right flank. She has vomited thrice since yesterday and still feels nauseated. Her temperature today is 102.4F while her BP is 120/74mmhg. Please develop an admission note for this patient.

As the interns performed the clinical tasks using the patient records within the EHR system, the observing researcher would only interrupt if there is a short (15 – 20 seconds) period of silence in order to prompt the intern to continue ‘thinking aloud’.

Six clinical interns were observed as they utilized the text-based clinical documents for these controlled patient scenarios. The average length of a scenario observation was 18.96 minutes. The technical expertise of the interns, in terms of EHR system use, ranged from intermediate to professional; each intern was familiar with at least three different vendor-based EHR systems in their clinical rotations. There were 2 male and 4 female subjects in our sample of interns between 26 and 30 years of age. Overall, 853 minutes of observations were transcribed and analyzed using the QSR NVIVO (version 9) qualitative analysis software.

4.1. Protocol Analysis

We reviewed all study transcripts to enable familiarity and to identify general impressions from the observational data. Consideration of our study objectives and literature on medical decision-making research and the use of think aloud protocols resulted in the use of a three-step coding scheme for the analysis of the study transcripts based on recognized frameworks for protocol analysis ( 18 , 24 – 26 ).

4.1.1. Referral Phrase Analysis

As a first step in the protocol analysis, the interns’ verbalizations were organized according to various concepts referred to by the nouns and noun phrases contained in the transcripts. The referral phrases identified were used in defining the concepts that constituted the main focus of intern reasoning as they performed the clinical tasks using the EHR clinical documents. The universe of concepts derived from the referral phrase analysis (RPA) constitutes an ontology for the virtual domain of information synthesis from EHR clinical documents ( 27 ). In order to ensure the validity of this coding procedure, the researcher continued with the RPA until all concepts within the transcribed data were adequately defined and coded ( Table 1 ). During the RPA, when a transcribed statement contained several nouns and/or noun phrases referring to multiple concepts, the statement was coded under all appropriate concepts in order to ensure completeness in the data analysis and to retain the contextual information within the statement. For instance, in the following statement:

“She has had headaches since last fall. So why does it improve with Levaquin? That’s an antibiotic!”

Referring Phrase Analysis

There are words and phrases that refer to the Symptom (She has had headaches…), Time (…since last fall…), and Treatment (…So why does it improve with Levaquin? That’s an antibiotic!) concepts.

4.1.2. Assertional Analysis

In the second coding step, assertions made by the interns were coded based on how they determined relationships between verbalized nouns and noun phrases as they performed stated clinical tasks using the EHR clinical documents. The assertional analysis (AA) facilitates the combination of the concepts identified in the RPA and the existing relationships between these concepts in order to understand the epistemology (the nature, validity and limitations) of information synthesis from EHR clinical documents as reflected by the study participants ( 18 , 27 ). Each statement under the RPA concepts were exclusively coded based on the whether the intern established any significative, implicative or causal relationship between concepts in the statement ( Table 2 ). In contrast to the RPA, the AA did not involve multiple coding of the same statement as each statement was assessed for the dominant relationship/assertions between concepts. For example, in this statement:

“I like that I see some of his past medical history like substance abuse.”

Assertional Analysis

Despite indicating that a past medical history of substance abuse is present in the patient’s record (implicative assertion), the highpoint of the statement is that the intern asserts the relevance of the past medical history to information processing; thus there is a relationship of significance (significative) between the past medical history ( Problem ) and the intern’s access to clinical information ( Format ).

4.1.3. Script Analysis

Script Analysis (SA), the final step in the protocol analysis, was carried out in order to determine the overall configuration of the interns’ cognitive activities during the experiments; the transcribed data were collectively reviewed and analyzed based on a reference frame of cognitive operators ( 24 ). These operators were defined based on the results of preceding analytic steps (RPA and AA). The SA identified predominant reasoning and decision-making processes involved as the EHR clinical documents were synthesized by the interns ( Table 3 ).

Script Analysis

To determine interrater reliability, a second researcher with recognized expertise in qualitative analysis (DSP), and who was familiar with the coding scheme, analyzed a subset representing 16% of the transcripts. Overall, the mean % agreement between the investigators was 82. Coding discrepancies between the investigators were discussed and addressed for potential overlaps.

4.1.4. Cognitive Pathway

There was considerable variation in the concepts and assertions identified as each intern reviewed and synthesized EHR clinical notes within the patient records. The three most frequently occurring RPA concepts were Problem (24%), Treatment (17%), and Symptom (13%), and relationships established between these concepts were mostly those of significance ( Significative, 56%) and meaning ( Implicative, 29%) ( Table 4 ). Based on these findings, in conjunction with operators observed during the SA ( Review, Assume, Explain and Decide ), we constructed a common cognitive pathway associated with the synthesis of EHR clinical documents by the interns ( Figure 3 ).

An external file that holds a picture, illustration, etc.
Object name is amia_2012_symp_1211f3.jpg

Common Cognitive Pathway of Interns Synthesizing EHR Clinical Documents

References in Referral Phrase (RPA) and Assertional (AA) Analyses

The pathway begins with attentive consideration of presenting complaints/symptoms and generation of hypotheses on etiologies and disease processes responsible for these complaints (A). This is followed by a thorough review of patient-specific facts regarding previous diagnoses (medical and surgical), familial medical conditions, and medically-relevant social habits, towards providing evidence to support the clinician’s hypotheses. This process facilitates the establishment of new connections between disease processes and presenting symptoms and distinguishing between exacerbations of previous complaints and the onset of new problems (B). In further clarifying and establishing the clinician’s hypotheses, deductive analysis of medications and other treatment regimen is carried out to determine their correlation with past and ongoing complaints and to ascertain the extent to which these interventions alleviate existing problems (C and D). Finally, based on knowledge acquired from previous clinical experience and evidences gathered via information synthesis, the clinician constructs a mental model that summarizes the presenting clinical scenario, narrows the range of possible diagnoses, and decides on specific clinical interventions to address these diagnoses (E).

4.2. Content Analysis

In order to identify potential barriers to information synthesis from EHR documents, we performed a content analysis of study transcripts and concentrated on themes related to the consumption of EHR documents ( Table 5 ).

References in Content Analysis

The main themes from our content analysis included:

Difficulty with Searching for Information: While synthesizing the EHR documents to provide care in line with stated clinical scenarios, clinicians experienced difficulties with searching out vital patient-specific details due to information overload and reduced motivation to find ‘the needle in the hay stack’. Inability to identify pertinent clinical data within EHR documents towards satisfying clinician information demands at the point of care can significantly reduce provider efficiency and the likelihood of them delivering quality healthcare. Some comments related to the difficulty in searching include:

“So it’s not really too obvious what the result was. Let’s see… still trying to find out what the pathology said.” “Am I missing something that is in here and I’m just not seeing it? I still don’t see a surgical history.”

Poor Document Readability : The general formatting of the EHR documents, including the layout of the sections within the document, largely determined the quantity and quality of information synthesized from these documents. Trending of past medical diagnoses, medications, and laboratory values were particularly difficult due to poor alignment of dates or incongruent organization of relevant patient information (e.g. interns’ comments) on reviewing medications and problem lists include:

“She has a lot of medications. I think it would be even better if they were listed in alphabetical order or some other way that would make them a little bit easier to read.” “This is kind of messy to read. I think this is better than the other list because it has some start and end dates.”

Good versus Bad Redundancy : In most instances, the interns thoroughly reviewed only the most recent document in the electronic patient record and browsed through the rest in search of new information that may be relevant to the clinical task being performed. As highlighted in similar studies ( 9 , 11 ) and suggested by the interns’ verbalizations, the redundant information contained in the older documents constituted a significant cognitive burden and resulted in an increase in time and mental efforts required to review the patient records during the TA protocol. However, valuable insights about the overall clinical picture documented in the patient records often depended on the interns’ review of the redundant information as noted in statements like:

“A lot of redundancy in this note. It doesn’t flow and make the most sense but it had lots of good information.” “A lot of these are kind of carried over from the last one, which doesn’t always change like social history and stuff like that. So, it’s good just to have it in there. But it’s not giving me any new information.”

Unfamiliar Specialized Terms : Due to the sub-expert clinical experience of the interns, and the diverse medical specialties (e.g., pulmonology, cardiology) represented in the EHR documents reviewed during the TA protocol, some terms and abbreviations specific to these specialties were incomprehensible and could not be synthesized along with other relevant patient information. Although the inability to interpret these terms did not result in misdirected clinical decisions, there was likely an increased cognitive burden associated with processing these unfamiliar terms in addition to other patient-specific information. Statements that revealed the interns attempt at interpreting specialized terms and abbreviations include:

“So, now she’s had two weeks of diarrhea. But it’s improved with a BRAT diet. I don’t know what that is.” “Fusion of neck… fusion… neck…I don’t know what that is.”

5. Discussion

To improve the impact of EHR clinical documents on patient care, the organization and presentation of patient information should be in sync with the mental models and expectations of clinicians. Our study provides insights on the cognitive processes associated with synthesis of lengthy text-based EHR clinical documents during patient care. We utilized a think-aloud protocol to explicate the cognitive processes of six medical interns as they synthesized EHR clinical documents towards accomplishing routine clinical tasks within a simulated clinical setting. Our findings reveal that, in creating concise conceptualizations of the clinical scenarios, clinicians often synthesized information related to the concepts of problems, symptoms and treatment, thus corroborating evidence that clinicians screen and prioritize clinical information while managing information overload and redundancy as they review electronic clinical documents during patient care ( 29 , 30 ). The clinicians established mostly correlations of significance and meaning between these concepts, and these correlations informed hypotheses generation on etiology and disease processes, and decisions on the most appropriate treatment regimens. These insights also informed the construction of a common cognitive pathway for clinicians and provided a platform for content analysis of the clinicians’ cognitive processes to identify barriers to information synthesis from EHR documents.

In addition, knowledge from our research informed the development of recommendations for the design of EHR document user interfaces that can support clinicians’ information synthesis in order to reduce existing cognitive burden and generate effective action sequences while performing clinical tasks. These recommendations include:

Cues for Improved Visualization of Sections : Display and organization of information within EHR clinical document user interfaces can effectively reduce the likelihood of missing data necessary for appropriate diagnosis and treatment of clinical conditions. Knowledge from this research suggests that EHR document sections containing information related to the concepts of problem, symptom, and treatment are among the most critical to clinical reasoning and decision-making. Therefore, we recommended that software development efforts and HIT research be devoted to developing and implementing solutions towards visually emphasizing these sections in order to support the critical cognitive activities dependent on access to patient information related to the aforementioned concepts. Examples of possible data visualization aids include, but are not limited to, (a) distinct manipulation of fonts in sections or section headers related to problem, symptoms, and treatment; (b) line-spacing and paragraphing to better organize and distinguish these sections in the EHR document; and (c) color-coded highlighting of section headers within the EHR clinical document user interface.

Highlighting Status Changes in Patient Information : As noted above, excessive redundancy arising from ‘copying and pasting’ unchanged patient data can make it difficult to find information of interest within EHR documents, promote the propagation of data inconsistencies ( 11 ), and make the process of reviewing these documents error-prone and time consuming. However, redundant information can contribute to creating a contextual framework of clinical scenarios represented by narratives within the EHR. Therefore, to minimize the difficulty in navigation associated with duplication of clinical information and to leverage the contextual benefits of access to patient information, we recommend the implementation of methods to distinguish the most recent changes in patient information within the EHR clinical document as compared to details provided during a previous clinician-patient encounter. One of these methods involves highlighting these changes such that inductive cues are provided to aid clinicians in tracking and interpreting changes in the patient’s healthcare status over time. Further research in natural language processing (NLP) may be necessary to develop applications that identify these changes and possibly extract them for effective disease risk and patient outcome assessment. We are currently evaluating a prototype of a visualization tool to test the effect of this for clinicians in using clinical notes.

Glossary or Infolinks to Specialized Terms : Due to the continuum in clinical expertise and distinct nomenclature in several clinical specialties, demand for clinical decision support towards improved clinical expertise development may require ready access to tools that can aid the interpretation of terms and abbreviations commonly encountered while synthesizing documentations of patient care specific to certain specialties ( 28 ). Therefore, we recommend the development of customizable electronic glossaries of specialty-biased terms and abbreviations that can be edited by local and/or national clinical specialty organizations. Implementation of text-based infolinks to these glossaries within the EHR clinical document user can facilitate interpretation and synthesis of specialized terms at the point of care.

Limitations in this study include our sampling of clinicians with expertise at the intern level only, which meant we did not explore the potential influence of differences in clinical expertise and specialties on cognitive processes employed in synthesizing EHR clinical documents. Since only medical interns at the University of Minnesota participated in this study, our findings may not adequately reflect the cognitive processes or barriers experienced by other inter-disciplinary healthcare providers (e.g. nurses, pharmacists) and interns in other institutions as they routinely utilize EHR documents in caring for patients. Verbal protocols obtained while the interns synthesized electronic clinical text during the TA experiments were not controlled for quantity of speech and the possibility of additional cognitive processes directly related to ‘speaking one’s thoughts’ during task performance. Also, the design of the prototype EHR system used in the study may have influenced the cognitive strategies employed while synthesizing the EHR documents. Therefore, further studies are required to validate the observed cognitive activities as clinicians review electronic text within current vendor-based EHR applications. Finally, because the research was conducted in a simulated ambulatory setting using hypothetical clinical scenarios, TA experiments were void of any workflow interruptions and direct clinician-patient interaction (e.g. during physical examination) that are typical in realistic clinical settings. Therefore, the results of this study will need validation in the “in situ” clinical environment and among other groups of providers.

6. Conclusion

A scientific approach towards improving clinicians’ synthesis of text-based EHR clinical documents during patient care requires studying clinicians’ cognitive processes while performing routine clinical tasks using these documents. This work supports and informs the design of future EHR clinical document user interfaces. Qualitative methodologies utilized in this study were effective at revealing a range of cognitive processes and barriers associated with EHR document synthesis and helped to highlight how these processes can inform the design of EHR clinical document user interfaces. Given the limitations in our study, directions for future research include the design of appropriate validation studies to analyze cognitive processes associated with the synthesis of EHR clinical documents by physicians and other healthcare providers in various specialties and at different levels of clinical expertise within realistic patient care settings.

7. Acknowledgement

This study was supported by the University of Minnesota Institute for Health Informatics Research Support Grant. We would like to thank Fairview Health Services and the University of Minnesota interns participating in the study.

IMAGES

  1. CODE-EHR best-practice framework for the use of structured electronic

    analysis of ehr data for clinical research

  2. Ehr Implementation Plan Template

    analysis of ehr data for clinical research

  3. The Massive EHR Evaluation and Report

    analysis of ehr data for clinical research

  4. ELECTRONIC HEALTH RECORDS (EHRs)

    analysis of ehr data for clinical research

  5. A model for analysis and understanding of use-related risks of EHR

    analysis of ehr data for clinical research

  6. EHR as a clinical data repository.

    analysis of ehr data for clinical research

VIDEO

  1. is Lavanya Rao Perfect ? Watch this Video for Answers ! Miss Perfect Analysis #webseries #telugu

  2. FII still selling now our technical data is also negative for week lets see support and resistance

  3. GTA 6 *NEW* MASSIVE MAP LOCATION'S AND LEAKS YOU NEED TO KNOW!

  4. Junior doctors in England start longest strike in NHS history

  5. ਸਾਰੇ Districts ਦੇ DC ਸਾਹਿਬਾਨਾਂ ਨਾਲ Meeting ਮਗਰੋਂ CM Bhagwant Singh Mann ਦਾ ਵੱਡਾ ਐਲਾਨ

  6. "अपनी गलती कभी मत दोहराओ" Never Repeat Your Mistake by "Amit Kaushal Ji*

COMMENTS

  1. Electronic health records to facilitate clinical research

    Moving forward, EHR should be designed to optimize diagnosis and clinical care, which will enhance their relevance for clinical research. The EHR may reflect single components of care (e.g., primary care, emergency department, and intensive care unit) or data from an integrated hospital-wide or inter-hospital linked system . EHRs may also ...

  2. Using EHR to Conduct Outcome and Health Services Research

    7.1. Introduction. Data from electronic health records (EHR) can be a powerful tool for research. However, researchers must be aware of the fallibility of data collected for clinical purposes and of biases inherent to using EHR data to conduct sound health outcomes and health services research.

  3. PDF EHR Data Methodologies in Clinical Research: Perspectives from the Field

    This think tank was convened on December 11, 2014 for researchers from the field to discuss issues around the methodologies for optimizing the robustness and use of Electronic Health Records (EHR) data for a variety of clinical research purposes. Data contained in EHRs, including relevant laboratory, testing, procedure, and medication data.

  4. A Qualitative Analysis of the Impact of Electronic Health Records (EHR

    2. Massive data can be gathered from the EHR, which can be very useful for research and analysis. It takes time to get used to the EHR system and requires plenty of documentation, making EHR less efficient. Nurse practitioners: 1. EHR promotes an understanding of the doctor's plan in real-time. 2.

  5. Leveraging electronic health records for data science: common pitfalls

    Methodologically, EHR analysis is subject to distinct challenges because data are not collected for research purposes. In this Viewpoint, we elaborate on the importance of in-depth knowledge of clinical workflows and describe six potential pitfalls to be avoided when working with EHR data, drawing on examples from the literature and our experience.

  6. Electronic Health Records as Source of Research Data

    Electronic health records (EHRs) are the collection of all digitalized information regarding individual's health. EHRs are not only the base for storing clinical information for archival purposes, but they are also the bedrock on which clinical research and data science thrive. In this chapter, we describe the main aspects of good quality EHR systems, and some of the standard practices in ...

  7. Deep representation learning of electronic health records to unlock

    By generating disease subgroups from large-scale EHR data, this architecture can help disentangle clinical heterogeneity and identify high-impact patterns within complex disorders, whose effect ...

  8. Leveraging electronic health records for data science ...

    Methodologically, EHR analysis is subject to distinct challenges because data are not collected for research purposes. In this Viewpoint, we elaborate on the importance of in-depth knowledge of clinical workflows and describe six potential pitfalls to be avoided when working with EHR data, drawing on examples from the literature and our experience.

  9. Electronic health records to facilitate clinical research

    Electronic health records (EHRs) provide opportunities to enhance patient care, embed performance measures in clinical practice, and facilitate clinical research. Concerns have been raised about the increasing recruitment challenges in trials, burdensome and obtrusive data collection, and uncertain generalizability of the results. Leveraging electronic health records to counterbalance these ...

  10. Assessing EHR Data for Use in Clinical Improvement and Research

    Electronic Health Records*. Humans. Data from electronic health records (EHRs) are becoming accessible for use in clinical improvement projects and nursing research. But the data quality may not meet clinicians' and researchers' needs. EHR data, which are primarily collected to document clinical care, invariably contain errors and omi ….

  11. Accessing routinely collected health data to improve clinical trials

    The analysis of the EHR data of the two methodology projects described above is ongoing and will be the subject of separate publications which will further inform the discussion around the utility of EHR in trials. ... The randomized registry trial — the next disruptive technology in clinical research? New Engl J Med. 2013;369(17):1579-81.

  12. New developments in electronic health record analysis

    A newly developed natural language processing (NLP) pipeline can automatically extract outcome measures from EHR databases and has potential for EHR data analysis for both clinical and research ...

  13. Benefits, Challenges of Using EHR Data for Clinical Research

    A recent study published JAMIA noted several challenges to using EHR data for clinical research. Study authors collected survey responses from the research teams of 20 pragmatic clinical trial (PCT) projects, which leverage existing data streams in EHR systems and usually involve multiple clinical sites with broad eligibility criteria to ...

  14. Using electronic health record data for clinical research: a quick

    Abstract. Electronic Health Records*. Humans. Electronic health record (EHR) data not only offer many exciting research opportunities but also come with their own inherent limitations. Researchers may not always realise the challenges associated with the use of EHR data for research, or the fact that using large datasets of 'real-world data' do

  15. Challenges in and Opportunities for Electronic Health Record-Based Data

    Here, we aim to provide a critical understanding of the types of data available in EHRs and describe the impact of data heterogeneity, quality, and generalizability, which should be evaluated prior to and during the analysis of EHR data. We also identify challenges pertaining to data quality, including errors and biases, and examine potential ...

  16. Clinical Research Informatics and Electronic Health Record Data

    Introduction. The use of data derived from electronic health records (EHRs) for research and discovery is a growing area of investigation in clinical research informatics (CRI), defined as the intersection of research and biomedical informatics [].CRI has matured in recent years to be a prominent and active informatics sub-discipline [1, 2].CRI develops tools and methods to support researchers ...

  17. What Role Does the EHR Play in Clinical Informatics?

    EHR data can aid clinical informatics research through streamlined clinical trial recruitment, public health surveillance, and health IT analytics. Healthcare data are the "driving force" of clinical informatics, according to the American Medical Informatics Association ( AMIA ). With EHR systems serving as central repositories for healthcare ...

  18. EHR-Safe: Generating high-fidelity and privacy ...

    Analysis of Electronic Health Records has a tremendous potential for enhancing patient care, quantitatively measuring performance of clinical practices, and facilitating clinical research. Statistical estimation and machine learning (ML) models trained on EHR data can be used to predict the probability of various diseases (such as diabetes ...

  19. Overcoming the EHR-to-EDC Challenge in Clinical Trials

    Clinical researchers have long sought to repurpose EHR data at scale to support clinical research. As explained in our new white paper Solving the EHR-to-EDC Challenge: A Scalable-first Approach, multiple hurdles have hindered progress toward a truly scalable solution, including poor interoperability between EHRs and other systems, data quality ...

  20. Data analysis guidelines for single-cell RNA-seq in biomedical studies

    The application of single-cell RNA sequencing (scRNA-seq) in biomedical research has advanced our understanding of the pathogenesis of disease and provided valuable insights into new diagnostic and therapeutic strategies. With the expansion of capacity for high-throughput scRNA-seq, including clinical samples, the analysis of these huge volumes of data has become a daunting prospect for ...

  21. eClinical innovations streamline clinical trial processes: Insights

    She emphasized the importance of embracing these data sources effectively to derive meaningful insights and drive decision-making. "The clinical trial landscape is evolving rapidly, with advancements in digital health technologies and the availability of real-world data," Kaminski noted. "To stay ahead, organizations must adapt their ...

  22. Network analysis of unstructured EHR data for clinical research

    In biomedical research, network analysis provides a conceptual framework for interpreting data from high-throughput experiments. For example, protein-protein interaction networks have been successfully used to identify candidate disease genes. ... Network analysis of unstructured EHR data for clinical research AMIA Jt Summits Transl Sci Proc ...

  23. Impact of frailty on the outcomes of patients undergoing degenerative

    Background Degenerative spinal diseases are common in older adults with concurrent frailty. Preoperative frailty is a strong predictor of adverse clinical outcomes after surgery. This study aimed to investigate the association between health-related outcomes and frailty in patients undergoing spine surgery for degenerative spine diseases. Methods A systematic review and meta-analysis were ...

  24. Clinical Data Analytics Solutions Market Projected to Reach USD 7.5

    The global clinical data analytics solutions market size is anticipated to reach USD 7.5 billion by 2030, expanding at a notable CAGR of 6.8% from 2024 to 2030. Major factors driving the market ...

  25. Software

    Deliver reliable test results with greater confidence and ease. Our portfolio of software applications aid in the analysis and evaluation of data generated with our MICA, KIR, HLA typing, and antibody detection tests. Assisting with the assignment of typing and antibody detection results, our software further increases testing efficiency.

  26. Chapter 4 Obtaining Data From Electronic Health Records

    There is growing interest in using data captured in electronic health records (EHRs) for patient registries. Both EHRs and patient registries capture and use patient-level clinical information, but conceptually, they are designed for different purposes. A patient registry is defined as "an organized system that uses observational study methods to collect uniform data (clinical and other) to ...

  27. Lead EMR Analyst

    Apply for the Job in Lead EMR Analyst - Epic Research at Cincinnati, OH. View the job description, responsibilities and qualifications for this position. Research salary, company info, career paths, and top skills for Lead EMR Analyst - Epic Research

  28. Examination of the Effects of Cannabidiol on Menstrual-Related Symptoms

    This study has been presented at the 32nd Annual International Cannabinoid Research Society Symposium on the Cannabinoids in 2022.Morgan L. Ferretti and Jessica G. Irons had full access to all of the data in the study and took responsibility for the integrity of the data and the accuracy of its presentation.Morgan L. Ferretti played a lead role ...

  29. A Qualitative Analysis of EHR Clinical Document Synthesis by Clinicians

    2.1. Information Overload within EHR Systems . In clinical practice, complex data processing remains an integral aspect of problem-solving strategies utilized by experts, sub-experts and novices alike ()().Timely access to patient information relevant to routine and emergency clinical processes determines the clinician's familiarity with clinical concepts and the context of clinical situations.