• Skip to main content
  • Skip to FDA Search
  • Skip to in this section menu
  • Skip to footer links

U.S. flag

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

U.S. Food and Drug Administration

  •   Search
  •   Menu
  • Medical Devices
  • Science and Research | Medical Devices

iMRMC: Software to do Multi-reader Multi-case Statistical Analysis of Reader Studies

Catalog of Regulatory Science Tools to Help Assess New Medical Devices 

Technical Description

The primary objective of the iMRMC statistical software is to assist investigators with analyzing and sizing multi-reader multi-case (MRMC) reader studies that compare the difference in the area under Receiver Operating Characteristic curves (AUCs) from two modalities. The iMRMC application is a software package that includes simulation tools to characterize bias and variance of the MRMC variance estimates.

The core elements of this application include the ability to perform MRMC variance analysis and the ability to size an MRMC trial.

  • The core iMRMC application  is a stand-alone, precompiled, license-free Java applications and the source code. It can be used in GUI mode or on the command line.
  • There is also an R package that utilizes the core Java application. Examples for using the programs can be found in the R help files.
  • Additional functionality of the GitHub package includes an example to guide users on how to perform a noninferiority study using the iMRMC R package. 

The software treats arbitrary study designs that are not "fully-crossed."

Intended Purpose

The iMRMC package analyzes data from Multiple Readers and Multiple Cases (MRMC) studies, which are often imaging studies where clinicians (readers) evaluate patient images (cases). The MRMC methods apply to any scenario in which clinicians interpret data to make clinical decisions. The iMRMC package calculates the reader-averaged area under the receiver operating characteristic curve: the AUC of the ROC curve. AUC is a diagnostic performance measure. Additional functions analyze other endpoints (binary performance and score differences). This package also estimates variances, confidence intervals and p-values. These uncertainty characteristics are needed for hypothesis tests to size and assess the efficacy of diagnostic imaging devices and computer aids (artificial intelligence).

The analysis is important because imaging studies are designed so that every reader reads every case in all modalities, a fully-crossed study. In this case, the data is cross-correlated, and the readers and cases are considered to be cross-correlated random effects. An MRMC analysis accounts for the variability and correlations from the readers and cases when estimating variances, confidence intervals, and p-values. The functions in this package can treat arbitrary study designs and studies with missing data, not just fully-crossed study designs.

The methods in the iMRMC package are not standard. The package permits industry statisticians to use a validated statistical analysis method without having to develop and validate it themselves.

Related FDA Product Codes

The FDA product codes this tool is applicable to include, but are not limited to:

  • KPS: System, Tomography, Computed, Emission
  • LLZ: System, Image Processing, Radiological
  • PAA: Automated Breast Ultrasound
  • POK: Computer-Assisted Diagnostic Software For Lesions Suspicious For Cancer
  • QDQ: Radiological Computer Assisted Detection/Diagnosis Software For Lesions Suspicious For Cancer
  • QPN: Software Algorithm Device To Assist Users In Digital Pathology
  • QNP: Gastrointestinal lesion software detection system

The tool has been characterized through simulations (bias and variance of the estimates) and has been compared with other methods as appropriate for the task.

The following peer-reviewed research includes the detailed verification methods and results

  • Desc: Study that uses the software and related research methods and study designs in a large study. Supplementary materials include data and scripts to reproduce study results.
  • Desc: Original description of method and validation with simulations. Results comparable to jackknife resampling technique.
  • Generalize method to binary performance measures.
  • Provide framework for understanding method and comparing to other methods analytically and with simulations.
  • Gallas, B. D., & Brown, D. G. (2008). Reader studies for validation of CAD systems. Neural Networks Special Conference Issue, 21 (2), 387–397. https://doi.org/10.1016/j.neunet.2007.12.013

Limitations

Currently, the tool can produce negative variance estimates if the relevant dataset is small.

Supporting Documentation

Tool websites:.

  • Primary: https://github.com/DIDSR/iMRMC
  • Secondary: https://cran.r-project.org/web/packages/iMRMC/index.html

User manual for java app

  • http://didsr.github.io/iMRMC/000_iMRMC/userManualPDF/iMRMCuserManual.pdf

User manual for R package

  • https://cran.r-project.org/web/packages/iMRMC/iMRMC.pdf
  • https://github.com/DIDSR/iMRMC/wiki/iMRMC-FAQ

Supplementary materials

  • Data and scripts to reproduce results for manuscripts that use iMRMC
  • https://github.com/DIDSR/iMRMC/wiki/iMRMC-Datasets

Related Work

  • Chen, W., Gong, Q., Gallas, B.D. (2018). Paired split-plot designs of multireader multicase studies. Journal of Medical Imaging 5, 031410. https://doi.org/10.1117/1.JMI.5.3.031410
  • Obuchowski, N.A., Gallas, B.D., Hillis, S.L. (2012). Multi-Reader ROC studies with Split-Plot Designs: A Comparison of Statistical Methods. Acad Radiol 19, 1508– 1517. https://doi.org/10.1016/j.acra.2012.09.012
  • Gallas, B.D., Chan, H.-P., D’Orsi, C.J., Dodd, L.E., Giger, M.L., Gur, D., Krupinski,
  • E.A., Metz, C.E., Myers, K.J., Obuchowski, N.A., Sahiner, B., Toledano, A.Y., Zuley, M.L. (2012). Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. Acad Radiol 19, 463–477. https://doi.org/10.1016/j.acra.2011.12.016
  • Obuchowski, N. A., Gallas, B. D., & Hillis, S. L. (2012). Multi-Reader ROC studies with Split-Plot Designs: A Comparison of Statistical Methods. Academic Radiology, 19 (12), 1508–1517. https://doi.org/10.1016/j.acra.2012.09.012
  • Gallas, B. D., & Hillis, S. L. (2014). Generalized Roe and Metz ROC model: Analytic link between simulated decision scores and empirical AUC variances and covariances. J Med Img, 1 (3), 031006. https://doi.org/doi:10.1117/1.JMI.1.3.031006

[email protected]

Tool Reference 

In addition to citing relevant publications please reference the use of this tool using DOI: 10.5281/zenodo.8383591

For more information

  • Catalog of Regulatory Science Tools to Help Assess New Medical Devices

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Multi-Reader Multi-Case Studies Using the Area under the Receiver Operator Characteristic Curve as a Measure of Diagnostic Accuracy: Systematic Review with a Focus on Quality of Data Reporting

Affiliation Department of Radiology, Prince of Songkla University, Hat Yai, Thailand

Affiliation Centre for Medical Imaging, University College London, London, United Kingdom

* E-mail: [email protected]

Affiliation Nuffield Department of Primary Care Health Sciences, Oxford University, Oxford, United Kingdom

Affiliation Centre for Statistics in Medicine, Wolfson College, Oxford University, Oxford, United Kingdom

  • Thaworn Dendumrongsup, 
  • Andrew A. Plumb, 
  • Steve Halligan, 
  • Thomas R. Fanshawe, 
  • Douglas G. Altman, 
  • Susan Mallett

PLOS

  • Published: December 26, 2014
  • https://doi.org/10.1371/journal.pone.0116018
  • Reader Comments

Figure 1

Introduction

We examined the design, analysis and reporting in multi-reader multi-case (MRMC) research studies using the area under the receiver-operating curve (ROC AUC) as a measure of diagnostic performance.

We performed a systematic literature review from 2005 to 2013 inclusive to identify a minimum 50 studies. Articles of diagnostic test accuracy in humans were identified via their citation of key methodological articles dealing with MRMC ROC AUC. Two researchers in consensus then extracted information from primary articles relating to study characteristics and design, methods for reporting study outcomes, model fitting, model assumptions, presentation of results, and interpretation of findings. Results were summarized and presented with a descriptive analysis.

Sixty-four full papers were retrieved from 475 identified citations and ultimately 49 articles describing 51 studies were reviewed and extracted. Radiological imaging was the index test in all. Most studies focused on lesion detection vs. characterization and used less than 10 readers. Only 6 (12%) studies trained readers in advance to use the confidence scale used to build the ROC curve. Overall, description of confidence scores, the ROC curve and its analysis was often incomplete. For example, 21 (41%) studies presented no ROC curve and only 3 (6%) described the distribution of confidence scores. Of 30 studies presenting curves, only 4 (13%) presented the data points underlying the curve, thereby allowing assessment of extrapolation. The mean change in AUC was 0.05 (−0.05 to 0.28). Non-significant change in AUC was attributed to underpowering rather than the diagnostic test failing to improve diagnostic accuracy.

Conclusions

Data reporting in MRMC studies using ROC AUC as an outcome measure is frequently incomplete, hampering understanding of methods and the reliability of results and study conclusions. Authors using this analysis should be encouraged to provide a full description of their methods and results.

Citation: Dendumrongsup T, Plumb AA, Halligan S, Fanshawe TR, Altman DG, Mallett S (2014) Multi-Reader Multi-Case Studies Using the Area under the Receiver Operator Characteristic Curve as a Measure of Diagnostic Accuracy: Systematic Review with a Focus on Quality of Data Reporting. PLoS ONE 9(12): e116018. https://doi.org/10.1371/journal.pone.0116018

Editor: Delphine Sophie Courvoisier, University of Geneva, Switzerland

Received: September 23, 2014; Accepted: December 2, 2014; Published: December 26, 2014

Copyright: © 2014 Dendumrongsup et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. All relevant data are within the paper and its Supporting Information files.

Funding: This work was supported by the UK National Institute for Health (NIHR) Research under its Programme Grants for Applied Research funding scheme (RP-PG-0407-10338). The funder had no role in the design, execution, analysis, reporting, or decision to submit for publication.

Competing interests: The authors have declared that no competing interests exist.

The receiver operator characteristic (ROC) curve describes a plot of sensitivity versus 1-specificity for a diagnostic test, across the whole range of possible diagnostic thresholds [1] . The area under the ROC curve (ROC AUC) is a well-recognised single measure that is often used to combine elements of both sensitivity and specificity, sometimes replacing these two measures. ROC AUC is often used to describe the diagnostic performance of radiological tests, either to compare the performance of different tests or the same test under different circumstances [2] , [3] . Radiological tests must be interpreted by human observers and a common study design uses multiple readers to interpret multiple image cases; the multi-reader multi-case (MRMC) design [4] . The MRMC design is popular because once a radiologist has viewed 20 cases there is less information to be gained by asking him to view a further 20 than by asking a different radiologist to view the same 20. This procedure enhances the generalisability of study results and having multiple readers interpret multiple cases enhances statistical power. Because multiple radiologists view the same cases, “clustering” occurs. For example, small lesions are generally seen less frequently than larger lesions, i.e. reader observations are clustered within cases. Similarly, more experienced readers are likely to perform better across a series of cases than less experienced readers, i.e. results are correlated within readers. Bootstrap resampling and multilevel modeling can account for clustering, linking results from the same observers and cases, so that 95% confidence intervals are not too narrow. MRMC studies using ROC AUC as the primary outcome are often required by regulatory bodies for the licensing of new radiological devices [5] .

We attempted to use ROC AUC as the primary outcome measure in a prior MRMC study of computer-assisted detection (CAD) for CT colonography [6] . However, we encountered several difficulties when trying to implement this approach, described in detail elsewhere [7] . Many of these difficulties were related to issues implementing confidence scores in a transparent and reliable fashion, which led ultimately to a flawed analysis. We considered, therefore, that for ROC AUC to be a valid measure there are methodological components that need addressing in study design, data collection and analysis, and interpretation. Based on our attempts to implement the MRMC ROC AUC analysis, we were interested in whether other researchers have encountered similar hurdles and, if so, how these issues were tackled.

In order to investigate how often other studies have addressed and reported on methodological issues with implementing ROC AUC, we performed a systematic review of MRMC studies using ROC AUC an outcome measure. We searched and investigated the available literature with the objective to describe the statistical methods used, the completeness of data presentation, and investigate whether any problems with analysis were encountered and reported.

Ethics statement

Ethical approval is not required by our institutions for research studies of published data.

Search strategy, inclusion and exclusion criteria

This systematic review was performed guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA), an evidence-based minimum set of items for reporting in systematic reviews and meta-analyses [8] . We developed an extraction sheet for the systematic review, broken down into different sections (used as subheadings for the Results section of this report), with notes relating to each individual item extracted ( S1 File ). In consensus we considered approximately 50 articles would provide a sufficiently representative overview of current reporting practice. Based on our prior experience of performing systematic reviews we believed that searching for additional articles beyond 50 would be unlikely to yield valuable additional data (i.e. we believed we would reach “saturation” by 50 articles) yet would present a very considerable extraction burden.

In order to achieve this, potentially eligible primary articles published between 2005 and February 2013 inclusive were identified by a radiologist researcher (TD) using PUBMED via their citation of one or more of 8 key methodological articles relating to MRMC ROC AUC analysis [9] – [16] . To achieve this the Authors' names (combined using “AND”) were entered in the PUBMED search field and the specific article identified and clicked in the results list. The abstract was then accessed and the “Cited By # PubMed Central Articles” link and “Related Citations” link used to identify those articles in the PubMed Central database that have cited the original article. There was no language restriction. Online abstracts were examined in reverse chronological order, the full text of potentially eligible papers then retrieved, and selection stopped once the threshold of 50 studies fulfilling inclusion criteria had been passed.

To be eligible, primary studies had to be diagnostic test accuracy studies of human observers interpreting medical image data from real patients, and attempting to use a MRMC ROC AUC analysis as a study outcome based on the following methodological approaches [9] – [16] ; Reviews, solely methodological papers, and those using simulated imaging data were excluded.

Data extraction

An initial pilot sample of 5 full-paper articles were extracted and the data checked by a subgroup of investigators in consensus, to both confirm the process was feasible and to identify potential problems. These papers were extracted by TD using the search strategy described in the previous section. A further 10 full-papers were extracted by two radiologist researchers again using the same search strategy and working independently (TD, AP) to check agreement further. The remaining articles included in the review were extracted predominantly by TD, who discussed any concerns/uncertainty with AP. Any disagreement following their discussion was arbitrated by SH and/or SM where necessary. These discussions took place during two meetings when the authors met to discuss progress of the review; multiple papers and issues were discussed on each occasion.

The extraction covered the following broad topics: Study characteristics, methods to record study outcomes, model assumptions, model fitting, data presentation ( S1 File ).

We extracted data relating to the organ and disease studied, the nature of the diagnostic task (e.g. characterization vs. localization vs. presence/absence), test methods, patient source and characteristics, study design (e.g. prospective/retrospective, secondary analysis, single/multicenter) and reference standard. We extracted the number of readers, their prior experience, specific interpretation training for the study (e.g. use of CAD software), blinding to clinical data and/or reference results, the number of times they read each case and the presence of any washout period to diminish recall bias, case ordering, and whether all readers read all cases (i.e. a fully-crossed design). We extracted the unit of analysis (e.g. patient vs. organ vs. segment), and sample size for patients with and without pathology.

We noted whether study imaging reflected normal daily clinical practice or was modified for study purposes (e.g. restricted to limited images). We noted the confidence scores used for the ROC curve and their scale, and whether training was provided for scoring. We noted if there were multiple lesions per unit of analysis. We noted if scoring differed for positive and negative patient cases, whether score distribution was reported, and whether transformation to a normal distribution was performed.

We extracted if ROC cures were presented in the published article and, if so, whether for individual readers, whether the curve was smoothed, and if underlying data points were shown. We defined unreasonable extrapolation as an absence of data in the right-hand 25% of the plot space. We noted the method for curve fitting and whether any problems with fitting were reported, and the method used to compare AUC or pAUC. We extracted the primary outcome, the accuracy measures reported, and whether these were overall or for individual readers. We noted the size of any change in AUC, whether this was significant, and made a subjective assessment of whether significance could be attributed to a single reader or case. We noted how the study authors interpreted change in AUC, if any, and whether any change was reported in terms of effect on individual patients. We also noted if a ROC researcher was named as an author or acknowledged, defined as an individual who had published indexed research papers dealing with ROC methodology.

Data were summarized in an Excel worksheet (Excel For Mac 14.3.9, Microsoft Corporation) with additional cells for explanatory free text. A radiologist researcher (SH) then compiled the data and extracted frequencies, consulting the two radiologists who performed the extraction for clarification when necessary. The investigator group discussed the implication of the data subsequently, to guide interpretation.

Four hundred and seventy five citations of the 8 key methodological papers were identified and 64 full papers retrieved subsequently. Fifteen [17] – [31] of these were rejected after reading the full text (the papers and reason for rejection are shown in Table 1 ) leaving 49 [32] – [80] for extraction and analysis that were published between 2010 and 2012 inclusive; these are detailed in Table 1 . Two papers [61] , [75] contributed two separate studies each, meaning that 51 studies were extracted in total. The PRISMA checklist [8] is detailed in Fig. 1 . The raw extracted data are available in S2 File .

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0116018.g001

thumbnail

https://doi.org/10.1371/journal.pone.0116018.t001

Study characteristics

The index test was imaging in all studies. Breast was the commonest organ studied (20 studies), followed by lung (11 studies) and brain (7 studies). Mammography (15 studies) was the commonest individual modality investigated, followed by plain film (12 studies), CT and MRI (11 studies each), tomosynthesis (six studies), ultrasound (two studies) and PET (one study); 9 studies investigated multiple modalities. In most studies (28 studies) the prime interpretation task was lesion detection. Eleven studies focused on lesion characterization and 12 combined detection and characterization. Forty-one studies compared 2 tests/conditions (i.e. a single test but used in different ways) to a reference standard (41 studies), while 2 studies compared 1 test/condition, 7 studies compared 3 tests/conditions, and 1 study compared 4 tests/conditions. Twenty-five studies combined data to create a reference standard while the reference was a single finding in 24 (14 imaging, 5 histology, 5 other – e.g. endoscopy). The reference method was unclear in 2 studies [54] , [55] .

Twenty-four studies were single center, 12 multicenter, with the number of centers unclear in 15 (29%) studies. Nine studies recruited symptomatic patients, 8 asymptomatic, and 7 a combination, but the majority (53%; 27 studies) did not state whether patients were symptomatic or not. 42 (82%) studies described the origin of patients with half of these stating a precise geographical region or hospital name. However, 9 (18%) studies did not sufficiently describe the source of patients and 21 (41%) did not describe patients' age and/or gender distribution.

Study design

Extracted data relating to study design and readers are presented graphically in Fig. 2 . Most studies (29; 57%) used patient data collected retrospectively. Fourteen (28%) were prospective while 2 used an existing database. Whether prospective/retrospective data was used was unstated/unclear in a further 6 (12%). While 13 studies (26%) used cases unselected other than for the disease in question, the majority (34; 67%) applied further criteria, for example to preselect “difficult” cases (11 studies), or to enrich disease prevalence (4 studies). How this selection bias was applied was stated explicitly in 18 (53%) of these 34. Whether selection bias was used was unclear in 4 studies.

thumbnail

https://doi.org/10.1371/journal.pone.0116018.g002

The number of readers per study ranged from 2 [56] to 258 [76] . The mean number was 13, median 6. The large majority of studies (35; 69%) used fewer than 10 readers. Reader experience was described in 40 (78%) studies but not in 11. Specific reader training for image interpretation was described in 31 (61%) studies. Readers were not trained specifically in 14 studies and in 6 it was unclear whether readers were trained specifically or not. Readers were blind to clinical information for individual patients in 37 (73%) studies, unblind in 3, and this information was unrecorded or uncertain in 11 (22%). Readers were blind to prevalence in the dataset in 21 (41%) studies, unblind in 2, but this information was unsure/unrecorded or uncertain in the majority (28, 55%).

Observers read the same patient case on more than one occasion in 50 studies; this information was unclear in the single further study [70] . A fully crossed design (i.e. all readers read all patients with all modalities) was used in 47 (92%) studies, but not stated explicitly in 23 of these. A single study [72] did not use a fully crossed design and the design was unclear or unrecorded in 3 [34] , [70] , [76] . Case ordering was randomised (either a different random order across all readers or a different random order for each individual reader) between consecutive readings in 31 (61%) studies, unchanged in 6, and unclear/unrecorded in 14 (27%). The ordering of the index test being compared varied between consecutive readings in 20 (39%) studies, was unchanged in 17 (33%), and was unclear/unrecorded in 14 (27%). 26 (51%) studies employed a time interval between readings that ranged from 3 hours [50] to 2 months [63] , with a median of 4 weeks. There was no interval (i.e. reading of cases in all conditions occurred at the same sitting) in 17 (33%) studies, and time interval was unclear/unrecorded in 8 (16%).

Methods of reporting study outcomes

The unit of analysis for the ROC AUC analysis was the patient in 23 (45%) studies, an organ in 5, an organ segment in 5, a lesion in 11 (22%), other in 2, and unclear or unrecorded in 6 (12%); one study [34] examined both organ and lesion so there were 52 extractions for this item. Analysis was based on multiple images in 33 (65%) studies, a single image in 16 (31%), multiple modalities in a single study [40] , and unclear in a single study [57] ; no study used videos.

The number of disease positive patients per study ranged between 10 [79] and 100 [53] (mean 42, median 48) in 46 studies, and was unclear/unrecorded in 5 studies. The number of disease positive units of outcome for the primary ROC AUC analysis ranged between 10 [79] and 240 [41] (mean 59, median 50) in 43 studies, and was unclear/unrecorded in 8 studies. The number of disease negative patients per study ranged between 3 [69] and 352 [34] (mean 66, median 38) in 44 studies, was zero in 1 study [80] , and was unclear/unrecorded in 6 studies. The number of disease negative units of analysis for the primary outcome for the ROC AUC analysis ranged between 10 [51] and 535 [39] (mean 99, median 68) in 42 studies, and was unclear/unrecorded in the remaining 9 studies. The large majority of studies (41, 80%) presented readers with an image or set of images reflecting normal clinical practice whereas 10 presented specific lesions or regions of interest to readers.

Calculation of ROC AUC requires the use of confidence scores, where readers rate their confidence in the presence of a lesion or its characterization. In our previous study [6] we identified the assignment of confidence scores to be potentially on separate scales for disease positive and negative cases [7] . For rating scores used to calculate ROC AUC, 25 (49%) studies used a relatively small number of categories (defined as up to 10) and 25 (49%) used larger scales or a continuous measurement (e.g. visual analogue scale). One study did not specify the scale used [76] . Only 6 (12%) studies stated explicitly that readers were trained in advance to use the scoring system, for example being encouraged to use the full range available. In 15 (29%) studies there was the potential for multiple abnormalities in each unit of analysis (stated explicitly by 12 of these). This situation was dealt with by asking readers to assess the most advanced or largest lesion (e.g. [43] ), by an analysis using the highest score attributed (e.g. [42] ), or by adopting a per-lesion analysis (e.g. [52] ). For 23 studies only a single abnormality per unit of analysis was possible, whereas this issue was unclear in 13 studies.

Model assumptions

The majority of studies (41, 80%) asked readers to ascribe the same scoring system to both disease-positive and disease-negative patients. Another 9 studies asked that different scoring systems be used, depending on whether the case was perceived as positive or negative (e.g. [61] ), or depending on the nature of the lesion perceived (e.g. [66] ). Scoring was unclear in a single study [76] . No study stated that two types of true-negative classifications were possible (i.e. where a lesion was seen but misclassified vs. not being seen at all), a situation that potentially applied to 22 (43%) of the 51 studies. Another concern occurs when more than one observation for each patient is included in the analysis, violating the assumption that data are independent. This could occur if multiple diseased segments were analysed for each patient without using a statistical method that treats these as clustered data. An even more flawed approach occurs when analysis includes one segment for patients without disease but multiple segments for patients with disease.

When publically available DBM MRMC software [81] is used for ROC AUC modeling, this requires assumptions of normality for confidence scores or their transformations if the standard parametric ROC curve fitting methods are used. When scores are not normally distributed, even if non parametric approaches are used to estimate ROC AUC, this lack of normality may indicate additional problems with obtaining reliable estimates of ROC AUC [82] – [86] . While 17 studies stated explicitly that the data fulfilled the assumptions necessary for modeling, none described whether confidence scores were transformed to a normal distribution for analysis. Indeed, only 3 studies [54] , [73] , [76] described the distribution of confidence scores, which was non-normal in each case.

Model fitting

Thirty (59%) studies presented ROC curves based on confidence scores; i.e. 21 (41%) studies showed no ROC curve. Of the 30 with curves, only 5 presented a curve for each reader whereas 24 presented curves averaged over all readers; a further study presented both. Of the 30 studies presenting ROC curves, 26 (87%) showed only smoothed curves, with the data points underlying the ROC curve presented in only 4 (13%) [43] , [51] , [63] , [78] . Thus, a ROC curve with underlying data points was presented in only 4 of 51 (8%) studies overall. The degree of extrapolation is critical in understanding the reliability of the ROC AUC result [7] . However, extrapolation could only be assessed in these four articles, with unreasonable extrapolation, by our definition, occurring in two [43] , [63] .

The majority of studies (31, 61%) did not specify the method used for curve fitting. Of the 20 that did, 7 used non-parametric methods (Trapezoidal/Wilcoxon), 8 used parametric methods (7 of which used Proproc), 3 used other methods, and 2 used a combination. Previous research [7] , [84] has demonstrated considerable problems fitting ROC curves due to degenerate data where the fitted ROC curve corresponds to vertical and horizontal lines, e.g there are no FP data. Only 2 articles described problems with curve fitting [55] , [61] . Two studies stated that data was degenerate: Subhas and co-workers [66] stated that, “data were not well dispersed over the five confidence level scores”. Moin and co-workers [53] stated that, “If we were to recode categories 1 and 2, and discard BI-RADS 0 in the ROC analysis, it would yield degenerative results because the total number of cases collected would not be adequate”. While all studies used MRMC AUC methods to compare AUC outcomes, 5 studies also used other methods (e.g. t-testing) [37] , [52] , [60] , [67] , [77] . Only 3 studies described using a partial AUC [42] , [55] , [77] . Forty-four studies additionally reported non-AUC outcomes (e.g. McNemar's test to compare test performance at a specified diagnostic threshold [58] , Wilcoxon signed rank test to compare changes in patient management decisions [64] ). Eight (16%) of the studies included a ROC researcher as an author [39] , [47] , [48] , [54] , [60] , [65] , [66] , [72] .

Presentation of results

Extracted data relating to the presentation of individual study results is presented graphically in Fig. 3 . All studies presented ROC AUC as an accuracy measure with 49 (96%) presenting the change in AUC for the conditions tested. Thirty-five (69%) studies presented additional measures such as change in sensitivity/specificity (24 studies), positive/negative predictive values (5 studies), or other measures (e.g. changes in clinical management decisions [64] , intraobserver agreement [36] ). Change in AUC was the primary outcome in 45 (88%) studies. Others used sensitivity [34] , [40] , accuracy [35] , [69] , the absolute AUC [44] or JAFROC figure of merit [68] . All studies presented an average of the primary outcome over all readers, with individual reader results presented in 38 (75%) studies but not in 13 (25%). The mean change/difference in AUC was 0.051 (range −0.052 to 0.280) across the extracted studies and was stated as “significant” in 31 and “non-significant” in the remaining 20. No study failed to comment on significance of the stated change/difference in AUC. In 22 studies we considered that a significant change in AUC was unlikely to be due to results from a single reader/patient but we could not determine whether this was possible in 11 studies, and judged this not-applicable in a further 18 studies. One study appeared to report an advantage for a test when the AUC increased, but not significantly [65] . There were 5 (10%) studies where there appeared to be discrepancies between the data presented in the abstract/text/ROC curve [36] , [38] , [69] , [77] , [80] .

thumbnail

https://doi.org/10.1371/journal.pone.0116018.g003

While the majority of studies (42, 82%) did not present an interpretation of their data framed in terms of changes to individual patient diagnoses, 9 (18%) did so, using outcomes in addition to ROC AUC: For example, as a false-positive to true-positive ratio [35] or the proportion of additional biopsies precipitated and disease detected [64] , or effect on callback rate [43] . The change in AUC was non-significant in 22 studies and in 12 of these the authors speculated why, for example stating that the number of cases was likely to be inadequate [65] , [70] , that the observer task was insufficiently taxing [36] , or that the difference was too subtle to be resolved [45] . For studies where a non-significant change in AUC was observed, authors sometimes framed this as demonstrating equivalence (16 studies, e.g. [55] , [74] ), stated that there were other benefits (3 studies), or adopted other interpretations. For example, one study stated that there were “beneficial” effects on many cases despite a non-significant change in AUC [54] and one study stated that the intervention “improved visibility” of microcalcifications noting that the lack of any statistically significant difference warranted further investigation [65] .

While many studies have used ROC AUC as an outcome measure, very little research has investigated how these studies are conducted, analysed and presented. We could find only a single existing systematic review that has investigated this question [87] . The authors stated in their Introduction, “we are not aware of any attempt to provide an overview of the kinds of ROC analyses that have been most commonly published in radiologic research.” They investigated articles published in the journal “Radiology” between 1997 and 2006, identifying 295 studies [87] . The authors concluded that “ROC analysis is widely used in radiologic research, confirming its fundamental role in assessing diagnostic performance”. For the present review, we wished to focus on MRMC studies specifically, since these are most complex and are often used as the basis for technology licensing. We also wished to broaden our search criteria beyond a single journal. Our systematic review found that the quality of data reporting in MRMC studies using ROC AUC as an outcome measure was frequently incomplete and who would therefore agree with the conclusions of Shiraishi et al. who stated that studies, “were not always adequate to support clear and clinically relevant conclusions” [87] .

Many omissions we identified were those related to general study design and execution, and are well-covered by the STARD initiative [88] as factors that should be reported in studies of diagnostic test accuracy in general. For example, we found that the number of participating research centres was unclear in approximately one-third of studies, that most studies did not describe whether patients were symptomatic or asymptomatic, that criteria applied to case selection were sometimes unclear, and that observer blinding was not mentioned in one-fifth of studies. Regarding statistical methods, STARD states that studies should, “describe methods for calculating or comparing measures of diagnostic accuracy” [88] ; this systematic review aimed to focus on description of methods for MRMC studies using ROC AUC as an outcome measure.

The large majority of studies used less than 10 observers, some did not describe reader experience, and the majority did not mention whether observers were aware of prevalence of abnormality, a factor that may influence diagnostic vigilance. Most studies required readers to detect lesions while a minority asked for characterization, and others were a combination of the two. We believe it is important for readers to understand the precise nature of the interpretative task since this will influence the rating scale used to build the ROC curve. A variety of units of analysis were adopted, with just under half being the patient case. We were surprised that some studies failed to record the number of disease-positive and disease-negative patients in their dataset. Concerning the confidence scales used to construct the ROC curve, only a small minority (12%) of studies stated that readers were trained to use these in advance of scoring. We believe such training is important so that readers can appreciate exactly how the interpretative task relates to the scale; there is evidence that radiologists score in different ways when asked to perform the same scoring task because of differences in how they interpret the task [89] . For example, readers should appreciate how the scale reflects lesion detection and/or characterization, especially if both are required, and how multiple abnormalities per unit of analysis are handled. Encouragement to use the full range of the scale is required for normal rating distributions. Whether readers must use the same scale for patients with and without pathology is also important to know.

Despite their importance for understanding the validity of study results, we found that description of the confidence scores, the ROC curve and its analysis was often incomplete. Strikingly, only three studies described the distribution of confidence scores and none stated whether transformation to a normal distribution was needed. When publically available DBM MRMC software (ref DBM) is used for ROC AUC modeling, this requires assumptions of normality for confidence scores or their transformations when ROC curve fitting methods are used. Where confidence scores are not normally distributed these software methods are not recommended [84] – [86] , [90] . Although Hanley shows that ROC curves can be reasonable under some distributions of non normal data [91] , concerns have been raised particularly in imaging detection studies measuring clinically useful tests with good performance to distinguish well defined abnormalities. In tests with good performance two factors make estimation of ROC AUC unreliable. Firstly readers' scores are by definition often at the ends of the confidence scale so that the confidence score distributions for normal and abnormal cases have very little overlap [82] – [86] . Secondly tests with good performance also have few false positives making ROC AUC estimation highly dependent on confidence scores assigned to possibly fewer than 5% or 10% of cases in the study [86] .

Most studies did not describe the method used for curve fitting. Over 40% of studies presented no ROC curve in the published article. When present, the large majority were smoothed and averaged over all readers. Only four articles presented data points underlying the curve meaning that the degree of any extrapolation could not be assessed despite this being an important factor regarding interpretation of results [92] . While, by definition, all studies used MRMC AUC methods, most reported additional non-AUC outcomes. Approximately one-quarter of studies did not present AUC data for individual readers. Because of this, variability between readers and/or the effect of individual readers on the ultimate statistical analysis could not be assessed.

Interpretation of study results was variable. Notably, when no significant change in AUC was demonstrated, authors stated that the number of cases was either insufficient or that the difference could not be resolved by the study, appearing to claim that their studies were underpowered rather than that the intervention was ineffective when required to improve diagnostic accuracy. Indeed some studies claimed an advantage for a new test in the face of a non-significant increase in AUC, or turned to other outcomes as proof of benefit. Some interpreted no significant difference in AUC as implying equivalence.

Our review does have limitations. Indexing of the statistical methods used to analyse studies is not common so we used a proxy to identify studies; their citation of “key” references related to MRMC ROC methodology. While it is possible we missed some studies, our aim was not to identify all studies using such analyses. Rather, we aimed to gather a representative sample that would provide a generalizable picture of how such studies are reported. It is also possible that by their citation of methodological papers (and on occasion including a ROC researcher as an author), our review was biased towards papers likely to be of higher methodological quality than average. This systematic review was cross-disciplinary and two radiological researchers performed the bulk of the extraction rather than statisticians. This proved challenging since the depth of statistical knowledge required was demanding, especially when details of the analysis was being considered. We anticipated this and piloted extraction on a sample of five papers to determine if the process was feasible, deciding that it was. Advice from experienced statisticians was also available when uncertainty arose.

In summary, via systematic review we found that MRMC studies using ROC AUC as the primary outcome measure often omit important information from both the study design and analysis, and presentation of results is frequently not comprehensive. Authors using MRMC ROC analyses should be encouraged to provide a full description of their methods and results so as to increase interpretability.

Supporting Information

Extraction sheet used for the systematic review.

https://doi.org/10.1371/journal.pone.0116018.s001

Raw data extracted for the systematic review.

https://doi.org/10.1371/journal.pone.0116018.s002

S1 PRISMA Checklist.

https://doi.org/10.1371/journal.pone.0116018.s003

Author Contributions

Conceived and designed the experiments: TD AAP SH TRF DGA SM. Performed the experiments: TD AAP SH TRF DGA SM. Analyzed the data: TD AAP SH TRF DGA SM. Contributed reagents/materials/analysis tools: TD AAP SH TRF DGA SM. Wrote the paper: TD AAP SH TRF DGA SM.

  • View Article
  • Google Scholar
  • 7. Mallett S, Halligan S, Collins GS, Altman DG (2014) Exploration of analysis methods for diagnostic imaging tests: Problems woth ROC AUC and confidence scores in CT colonography. PLoS One (in press).
  • 81. v2.1 DMs. Available: http://www-radiology.uchicago.edu/krl/KRL_ROC/software_index6.htm .
  • 85. Zhou XH, Obuchowski N, McClish DK (2002) Statistical methods in diagnostic medicine. New York NY: Wiley.

iMRMC Multi-Reader, Multi-Case Analysis Methods (ROC, Agreement, and Other Metrics)

  • convertDF: Convert MRMC data frames
  • convertDFtoDesignMatrix: Convert an MRMC data frame to a design matrix
  • convertDFtoScoreMatrix: Convert an MRMC data frame to a score matrix
  • createGroups: Assign a group label to items in a vector
  • createIMRMCdf: Convert a data frame with all needed factors to doIMRMC...
  • doIMRMC: MRMC analysis of the area under the ROC curve
  • extractPairedComparisonsBRBM: Extract between-reader between-modality pairs of scores
  • extractPairedComparisonsWRBM: Extract within-reader between-modality pairs of scores
  • getBRBM: Get between-reader, between-modality paired data from an MRMC...
  • getMRMCscore: Get a score from an MRMC data frame
  • getWRBM: Get within-reader, between-modality paired data from an MRMC...
  • init.lecuyerRNG: Initialize the l'Ecuyer random number generator
  • laBRBM: MRMC analysis of between-reader between-modality limits of...
  • laWRBM: MRMC analysis of within-reader between-modality limits of...
  • renameCol: Rename a data frame column name or a list object name
  • roc2binary: Convert ROC data formatted for doIMRMC to TPF and FPF data...
  • roeMetzConfigs: roeMetzConfigs
  • sim.gRoeMetz: Simulate an MRMC data set of an ROC experiment comparing two...
  • sim.gRoeMetz.config: Create a configuration object for the sim.gRoeMetz program
  • simMRMC: Simulate an MRMC data set
  • simRoeMetz.example: Simulates a sample MRMC ROC experiment
  • successDFtoROCdf: Convert an MRMC data frame of successes to one formatted for...
  • undoIMRMCdf: Convert a doIMRMC formatted data frame to a standard data...
  • uStat11: Analysis of U-statistics degree 1,1
  • uStat11.diff: Create the kernel and design matrices for uStat11
  • uStat11.identity: Create the kernel and design matrices for uStat11
  • Browse all...

iMRMC: Multi-Reader, Multi-Case Analysis Methods (ROC, Agreement, and Other Metrics)

Do Multi-Reader, Multi-Case (MRMC) analyses of data from imaging studies where clinicians (readers) evaluate patient images (cases). What does this mean? ... Many imaging studies are designed so that every reader reads every case in all modalities, a fully-crossed study. In this case, the data is cross-correlated, and we consider the readers and cases to be cross-correlated random effects. An MRMC analysis accounts for the variability and correlations from the readers and cases when estimating variances, confidence intervals, and p-values. The functions in this package can treat arbitrary study designs and studies with missing data, not just fully-crossed study designs. The initial package analyzes the reader-average area under the receiver operating characteristic (ROC) curve with U-statistics according to Gallas, Bandos, Samuelson, and Wagner 2009 <doi:10.1080/03610920802610084>. Additional functions analyze other endpoints with U-statistics (binary performance and score differences) following the work by Gallas, Pennello, and Myers 2007 <doi:10.1364/JOSAA.24.000B70>. Package development and documentation is at <https://github.com/DIDSR/iMRMC/tree/master>.

Getting started

Browse package contents, try the imrmc package in your browser.

Any scripts or data that you put into this service are public.

R Package Documentation

Browse r packages, we want your feedback.

multi reader multi case study

Add the following code to your website.

REMOVE THIS Copy to clipboard

For more information on customizing the embed code, read Embedding Snippets .

Advertisement

Advertisement

Chest radiograph classification and severity of suspected COVID-19 by different radiologist groups and attending clinicians: multi-reader, multi-case study

  • Open access
  • Published: 25 October 2022
  • Volume 33 , pages 2096–2104, ( 2023 )

Cite this article

You have full access to this open access article

  • Arjun Nair   ORCID: orcid.org/0000-0001-9270-3771 1 ,
  • Alexander Procter 1 ,
  • Steve Halligan 2 ,
  • Thomas Parry 2 ,
  • Asia Ahmed 1 ,
  • Mark Duncan 1 ,
  • Magali Taylor 1 ,
  • Manil Chouhan 1 ,
  • Trevor Gaunt 1 ,
  • James Roberts 1 ,
  • Niels van Vucht 1 ,
  • Alan Campbell 1 ,
  • Laura May Davis 1 ,
  • Joseph Jacob 3 ,
  • Rachel Hubbard 1 ,
  • Shankar Kumar 1 ,
  • Ammaarah Said 1 ,
  • Xinhui Chan 4 ,
  • Tim Cutfield 4 ,
  • Akish Luintel 4 ,
  • Michael Marks 4 ,
  • Neil Stone 4 &
  • Sue Mallet 2  

1923 Accesses

1 Altmetric

Explore all metrics

To quantify reader agreement for the British Society of Thoracic Imaging (BSTI) diagnostic and severity classification for COVID-19 on chest radiographs (CXR), in particular agreement for an indeterminate CXR that could instigate CT imaging, from single and paired images.

Twenty readers (four groups of five individuals)—consultant chest (CCR), general consultant (GCR), and specialist registrar (RSR) radiologists, and infectious diseases clinicians (IDR)—assigned BSTI categories and severity in addition to modified Covid-Radiographic Assessment of Lung Edema Score (Covid-RALES), to 305 CXRs (129 paired; 2 time points) from 176 guideline-defined COVID-19 patients. Percentage agreement with a consensus of two chest radiologists was calculated for (1) categorisation to those needing CT (indeterminate) versus those that did not (classic/probable, non-COVID-19); (2) severity; and (3) severity change on paired CXRs using the two scoring systems.

Agreement with consensus for the indeterminate category was low across all groups (28–37%). Agreement for other BSTI categories was highest for classic/probable for the other three reader groups (66–76%) compared to GCR (49%). Agreement for normal was similar across all radiologists (54–61%) but lower for IDR (31%). Agreement for a severe CXR was lower for GCR (65%), compared to the other three reader groups (84–95%). For all groups, agreement for changes across paired CXRs was modest.

Agreement for the indeterminate BSTI COVID-19 CXR category is low, and generally moderate for the other BSTI categories and for severity change, suggesting that the test, rather than readers, is limited in utility for both deciding disposition and serial monitoring.

• Across different reader groups, agreement for COVID-19 diagnostic categorisation on CXR varies widely.

• Agreement varies to a degree that may render CXR alone ineffective for triage, especially for indeterminate cases.

• Agreement for serial CXR change is moderate, limiting utility in guiding management.

Similar content being viewed by others

multi reader multi case study

Observer agreement and clinical significance of chest CT reporting in patients suspected of COVID-19

Marie-Pierre Debray, Helena Tarabay, … Antoine Khalil

multi reader multi case study

Diagnostic accuracy and inter-observer agreement with the CO-RADS lexicon for CT chest reporting in COVID-19

Anirudh Venugopalan Nair, Matthew McInnes, … Mahmoud Al-Heidous

multi reader multi case study

Inter-reader agreement of high-resolution computed tomography findings in patients with COVID-19 pneumonia: A multi-reader study

Lorenzo Cereser, Rossano Girometti, … Chiara Zuiani

Avoid common mistakes on your manuscript.

Introduction

Coronavirus disease 2019 (COVID-19), caused by the novel severe acute respiratory syndrome coronavirus 2 virus (SARS-CoV-2), became a global pandemic. In the UK, the pandemic caused record deaths and exerted unprecedented strain on the National Health Service (NHS). Facing such overwhelming demand, clinicians must rapidly and accurately categorise patients with suspected COVID-19 into high and low probability and severity. In March 2020, the British Society of Thoracic Imaging (BSTI) and NHS England produced a decision support algorithm to triage suspected COVID-19 patients [ 1 ]. This assumed that laboratory diagnosis might not be rapidly or widely available, emphasising clinical assessment and chest radiography (CXR).

CXR therefore assumes a pivotal role, not only in diagnosis but also in the classification and monitoring of severity, which directs clinical decision-making. This includes whether intensive treatment is required (those with “classic severe” disease), along with subsequent chest computed tomography (CT) in those with uncertain diagnosis [ 2 , 3 , 4 ] or whose CXR is deteriorating.

Clearly, this requires that CXR interpretation reflects both diagnosis and severity accurately. While immediate interpretation by specialist chest radiologists is desirable, this is unrealistic given demands, and interpretation falls frequently to non-chest radiologists, radiologists in training, or attending clinicians. However, we are unaware of any study that compares agreement and variation between these groups for CXR diagnosis and severity of COVID-19. We aimed to rectify this by performing a multi-case, multi-reader study comparing the interpretation of radiologists (including specialists, non-specialists, and trainees) and non-radiologists to a consensus reference standard, for the CXR diagnosis, severity, and temporal change of COVID-19.

Due to the continued admission of patients to hospital for COVID-19 as the virus becomes another seasonal coronavirus infection, this study has important ongoing relevance to clinical practice.

Materials and methods

Study design and ethical approval.

We used a multi-reader, multi-case design in this single-centre study. Our institution granted ethical approval for COVID-19-related imaging studies (Integrated Research Application Service reference IRAS 282063). Informed consent was waived as part of the approval.

Study population and image acquisition

A list of patients aged ≥ 18 years of age consecutively presenting to our emergency department with suspected COVID-19 infection, as per contemporary national and international definitions [ 5 ], between 25 th February 2020 and 22 nd April 2020, who had undergone at least one CXR, was supplied by our infectious diseases clinical team. All CXRs were acquired as computed or digital radiographs, in the anteroposterior (AP) projection using portable X-ray units as per institutional protocol.

We recruited four groups of readers (each consisting of five individuals), required to interpret suspected COVID-19 CXR in daily practice, as follows:

Group 1: Consultant chest radiologists (CCR) (with 7 to 19 years of radiology experience)

Group 2: Consultant radiologists not specialising in chest radiology (GCR) (with 8–30 years of radiology experience)

Group 3: Radiology specialist residents in training (RSR) (with 2–5 years of radiology experience

Group 4: Infectious diseases consultants and senior trainees (IDR) (with no prior radiology experience)

ID clinicians were chosen as a non-radiologist group because, at our institution and others, their daily practice necessitated both triage and subsequent management of COVID-19 patients via their own interpretation of CXR without radiological assistance.

Case identification, allocation, and consensus standard

Two subspecialist chest radiologists (with 16 and seven years of experience, respectively) first independently assigned BSTI classifications (Table 1 ) to the CXRs of 266 consecutive eligible patients, unaware of the ultimate diagnosis and all clinical information. Of these, 129 had paired CXRs; that is, they had a second CXR at least 24 h after their presentation CXR. The remaining 137 patients had a single presentation CXR. We included patients with unpaired CXRs as well as paired CXRs to enable us to enrich the study cohort with potential CVCX2 cases, because a high institutional prevalence of COVID-19 during the study period meant that few consecutive cases would be designated “indeterminate” or “normal”. However, evaluating this category is central to understanding downstream management implications for patients. There were 47/137 unpaired CXRs where at least one of the two subspecialist chest radiologists classified the CXR as CVCX2 (indeterminate), and so we used these 47 CXRs to enrich the cohort with CVCX2 cases. The final study cohort comprised 176 patients with 305 CXRs: 129 paired and 47 unpaired.

From this cohort of 305 CXRs, five random reading sets were generated, each containing approximately equal numbers of paired and unpaired CXRs (Table 2 ); each CXR was interpreted by 2 readers from each group. The same reader interpreted both time points for paired CXRs. Minor number variations were due to randomisation. Accordingly, individuals designated Reader 1 (CCR1, GCR1, RSR1, and IDR1) in each group would read the same cases, Reader 2 would read the same, and so on. In this way, 610 reads were generated for each reader group, resulting in 2440 reads overall (Fig. 1 ). The distribution of the total number of these cases paralleled cumulative COVID-19 referrals to London hospitals over the period under study (Fig. S1 ).

figure 1

STARD flowchart showing the derivation of CXR reading dataset per reading group

The same two subspecialist chest radiologists assigned an “expert consensus” score to all 305 CXRs at a separate sitting, two months following their original reading to avoid recall bias, blinded to any reader interpretation (including their own).

Image interpretation

Readers were provided with a refresher presentation explaining BSTI categorisation and severity scoring, with examples. Readers were asked to assume they were reading in a high prevalence “pandemic” clinical scenario, with high pre-test probability, and to categorise incidental findings (e.g. cardiomegaly or minor atelectasis) as CVCX0, and any non-COVID-19 process (e.g. cardiac failure) as CVCX3.

Irrespective of the diagnostic category, we asked readers to classify severity using two scoring systems: the subjective BSTI severity scale (normal, mild, moderate, or severe), and a semiquantitative score (“Covid-RALES”) modified by Wong et al for COVID-19 CXR interpretation from the Radiographic Assessment of Lung Edema (RALE) score [ 3 ]. This score quantifies the extent of airspace opacification in each lung (0 = no involvement; 1 = < 25%; 2 = 25–50%; 3 = 50–75%; 4 = > 75% involvement). Thus, the minimum possible score is 0 and the maximum 8. We evaluated this score because it has been assessed by others and is used to assess the severity for clinical trials at our institution.

All cases were assigned a unique anonymised identifier on our institutional Picture Archiving and Communications System (PACS). Readers viewed each CXR unaware of clinical information and any prior or subsequent imaging. Any paired CXRs were therefore read as individual studies, without direct comparison between pairs. Observers evaluated CXRs on displays replicating their normal practice. Thus, radiologists used displays conforming to standards set by the Royal College of Radiologists while ID clinicians used high-definition flat panel liquid crystal display (LCD) monitors used for ward-based clinical image review at our institution.

Sample size and power calculation

The study was powered to detect a 10% difference between experts and other reader groups for correct detection of CXR reads for CT referral based on indeterminate CXR findings (defined as CVCX2). It was estimated the most experienced group (CCR) would correctly refer 90% of patients to CT. At 80% power, 86 indeterminates would be required to detect a 10% difference in referral to CT using paired proportions, requiring 305 CXRs (176 patients) based on the prevalence of uncertain findings in pre-study reads of CXRs by 2 expert readers > 1 months prior to study reads.

Statistical analysis

The primary outcome was reader group agreement with expert consensus for an indeterminate CXR which, from the BSTI is the surrogate for CT referral. Indeterminate COVID-19 (CVCX2) is the potential surrogate for triage for CT, but an alternative clinical triage categorisation for CT referral would be to combine “indeterminate” and “normal” BSTI categories (CVCX0 and CVCX2). Therefore, we first calculated the percentage agreement between each reader and the consensus reading for each BSTI diagnostic categorisation. We then also assessed this percentage agreement when the BSTI categorisation was dichotomised to (1) CVCX0 and CVCX2 (i.e. the categories that might still warrant CT if there was sufficiently high clinical suspicion), versus (2) CVCX1 and CVCX3 (i.e. the categories that would probably not warrant CT). We assessed agreement for BSTI severity scoring. All percentage agreements are described with their means and 95% confidence intervals per reader group.

Finally, for paired CXR reads we calculated the number and percentage agreement between each group and the consensus standard for no change, decrease, or increase in (1), the BSTI severity classification and (2), the COVID-RALES.

Baseline characteristics

The 176 patients had a median age of 70 years (range 18–99 years); 118 (67%) were male. Due to image processing errors, a CXR was unreadable in one patient without paired imaging and three with, leaving 301 CXRs.

The expert consensus assigned the following BSTI categories: CVCX0 in 97 (32%), CVCX1 in 119 (40%), CVCX2 in 58 (19%), and CVCX3 in 27 (9.0%). Consensus BSTI severity was normal, mild, moderate, or severe in 97 (32%), 93 (31%), 68 (23%), and 27 (14%) respectively. The median consensus COVID-RALES was 2, IQR 0 to 4, range 0 to 8).

Agreement for indeterminate category (Fig. 2 )

Our primary outcome was reader group agreement with expert consensus for indeterminate COVID-19 (CVCX2), reflecting potential triage to CT. The mean agreement for CVCX2 was generally low (28 to 37%). For all reader groups, the main alternative classification for CVCX2 was CVCX1 (“classic” COVID-19), followed by CVCX3 (not COVID-19) (Fig. S2 ). Even CCR1 and CCR2, who were the two subspecialist readers composing the expert consensus, demonstrated low agreement with their own consensus for CVCX2 (Fig. S3 ). These data suggest that basing CT referral on CXR interpretation is unreliable, even when interpreted by chest subspecialist radiologists.

figure 2

Percentage agreement with consensus for individual BSTI categories for reader groups

An alternative clinical triage categorisation for CT referral would be to combine “indeterminate” and “normal” BSTI categories (CVCX0 and CVCX2), which resulted in higher agreement (CCR 73% (95% CI 68%, 77)% , RSR 75% (71%, 79%), GCR 58% (53%, 62%), and IDR 61% (56%, 65%)).

Agreement for BSTI categorisation (Table 3 and Fig. 2 )

Agreement was highest for CVCX1 (“classic/probable”) for the CCR (75% (69%, 80%)), RSR (76% (71%, 81%)), and IDR (66%(60%, 72%)) groups, but interestingly not for GCR (49% (43%, 55%)), where agreement was comparable to their agreement for CVCX0 and CVCX3 (“non-COVID-19”) (although still higher than their agreement for CVCX2 (“indeterminate”)). When disagreeing with the consensus CVCX1, GCR were most likely to assign CVCX2 (Fig. S1 ).

Agreement with consensus for CVCX0 (“normal”) was similar for radiologists of all types (mean agreement for CCR, GCR, and RSR of 59%, 54%, and 61% respectively), but lower for IDR (31%). For CVCX3 (not COVID-19), CCR and GCR were generally more likely than RSR and IDR readers to agree with the consensus.

Agreement for BSTI severity classification (Table 4 and Fig. 3 )

Agreement that classification was “severe” was highest for all groups, but lower for GCR (65% (54%, 74%)) than other groups (means of 95% (89%, 98%), 84% (74%, 90%), and 84% (75%, 90%) for CCR, RSR, and IDR respectively). The majority of consensus-graded normal cases were likely to be designated “mild” (Fig. S4 ).

figure 3

Percentage agreement with consensus for BSTI severity classification for reader groups

Agreement for change on CXRs (Table 5 and Fig. 4 )

The expert consensus reference found that the majority of BSTI severity scores did not change where paired CXR examinations were separated by just one or two days. Using the BSTI severity classification, the highest agreement with consensus across all groups was for “no change”, with percentage agreement of 66%, 61%, 44%, and 48% for CCR, GCR, RSR, and IDR respectively.

figure 4

Frequency charts showing agreement with consensus for score change using the BSTI severity classification ( a ) and the Covid-RALES for reader groups ( b )

In contrast, when using Covid-RALES, the highest agreement with consensus across all groups was for an “increased score”, with percentage agreement of 57%, 59%, 59%, and 47% for CCR, GCR, RSR, and IDR respectively. This most likely reflects the larger number of individual categories assigned by Covid-RALES.

Thus far, studies of CXR for COVID-19 have either reported its diagnostic accuracy [ 6 , 7 ], implications of CXR severity assessment using various scores [ 4 , 8 , 9 , 10 ], or quantification using computer vision techniques [ 11 , 12 , 13 ]. Inter-observer agreement for categorisation of COVID-19 CXRs, including for the BSTI classification (but not BSTI severity) has been assessed amongst consultant radiologists [ 14 ], while inter-observer differences according to radiologist experience have been described [ 15 , 16 ]. Notably, in a case-control study, Hare et al compared the agreement for the BSTI classification amongst seven consultant radiologists, including two fellowship-trained chest radiologists (with the latter providing the reference standard). They found that only fair agreement was obtained for the CVCX2 category κ  = 0.23), and “non-COVID-19” ( κ  = 0.37) categories, but that combining the scores of the CVCX2 and CVCX3 scores improved inter-observer agreement ( κ =  0.58) [ 14 ]. A recent study compared the sensitivity and specificity (but not agreement) of using the “classic/probable” BSTI category for COVID-19 diagnosis between Emergency Department clinicians and radiologists (both of various grades), based on a retrospective review of their classifications [ 17 ].

Our study differs in that it pivots around three potential clinical scenarios that use the CXR to manage suspected COVID-19. Using a prospective multi-reader, multi-case design, we determined reader agreement for four clinical groups who are tasked with CXR interpretation in daily practice and compared these to a consensus reference standard. Firstly, we evaluated reader agreement when using CXR to triage patients for CT when CXR imaging is insufficient to diagnose COVID-19. Secondly, we examined agreement for disease severity using two scores (BSTI and RALES). Thirdly, we investigated whether paired CXRs could monitor any change in severity.

When CXR was used to identify which patients need CT, based on our pre-specified BSTI category of an indeterminate interpretation, agreement with our consensus was low (28 to 37%) or moderate (58 to 75%) respectively. All four reader groups had a similar agreement to the consensus for identifying indeterminates, indicating that the level or specialism or radiologist expertise did not enhance agreement. When combining indeterminates with normal, GCR and IDR groups had lower agreement because the GCR group assigned more indeterminates as non-COVID-19, whereas the IDR group assigned more to classic/probable COVID-19.

Similar (albeit modest) agreement for the “normal” category amongst radiologists of all grades and types suggests that these factors are not influential when assigning this category. Radiologists seemed willing to consider many CXRs normal despite assuming a high prevalence setting. Reassuringly, this suggests that patient disposition, if based on normal CXR interpretation, is unlikely to vary much depending on the category of radiologist. Conversely, the lower agreement of ID clinicians for a normal CXR suggests an inclination to overall abnormal, since they classified normal CXRs as mostly “indeterminate” but also “classic/probable” COVID-19. We speculate that the contemporary pandemic clinical experience of ID clinicians made it difficult for them to consider a CXR normal, even when deprived of supporting clinical information.

In contrast, general consultant radiologists were less inclined to assign the “classic/probable” category, predominantly favouring the indeterminate category. Our results are somewhat at odds with Hare et al [ 14 ], who found substantial agreement for the CVCX1 category amongst seven consultant radiologists. Reasons underpinning the reticence to assign this category (even in a high prevalence setting) are difficult to intuit but may be partly attributable to a desire to adhere to strict definitions for this category, and thus maintain specificity.

Severity scores can quantify disease fluctuations that influence patient management, have prognostic implications [ 8 , 9 , 10 ], and may also be employed in clinical trials. However, this is only possible if scores are reliable, which is reflected by reader agreement regarding both value and change. For our second and third clinical scenarios for CXR, we also found that assessment of severity and change, and therefore of CXR severity itself, varied between reader groups and readers using either severity scoring system, but in different ways. It is probably unsurprising that agreement for no change in BSTI severity was highest for all reader groups, given that the four-grade nature of that classification is less likely to detect subtle change. In contrast, the finer gradation of Covid-RALES allows smaller severity increments to be captured more readily. A higher number of categories also encourages disagreement; despite this, agreement was modest.

We wished to examine CXR utility in a real-world clinical setting using consecutive patients presenting to our emergency department with suspected COVID-19 infection. Our findings are important because they examine clear clinical roles for CXR beyond a purely binary diagnosis of COVID-19 versus non-COVID-19. Rather, we examined the CXR as an aid to clinical decision-making and as an adjunct to clinical and molecular testing. CXR has moderate pooled sensitivity and specificity for COVID-19 (81% and 72% respectively) [ 18 ] and, in the context of other clinical and diagnostic tests [ 19 ], such diagnostic accuracy could be considered favourable. Although thoracic CT has a higher sensitivity for diagnosing COVID-19 [ 18 ], CXR has been used and investigated in this triage role both in the UK and internationally [ 20 ]. However, our results do have important implications when using CXR for diagnosis because interpretation appears susceptible to substantial inter-reader variation. Investigating reader variability will also be crucial for development, training, and evaluation of artificial intelligence algorithms to diagnose COVID-19, such as that now underway using the National COVID-19 Chest Imaging Database (NCCID) [ 21 ].

Our study has limitations. ID clinicians, as the first clinicians to assess potential COVID-19 cases, were the only group of non-radiologist clinicians evaluated. While we would wish to evaluate the emergency department and general medical colleagues also, this proved impractical. However, we have no a priori expectation that these would perform any differently. Our reference standard interpretation used two subspecialist chest radiologists; like any subjective standard, ours is imperfect, but with precedent [ 14 ]. We point out that our data around variability of reader classifications are robust regardless of the reference standard (see data in supplementary Figs. S2 and S4 ). Arguably, we disadvantaged ID clinicians by requiring them to interpret CXRs using LCD monitors, but this reflects normal clinical practice. It is possible that readers may have focussed on BSTI diagnostic categories in isolation, rather than considering the implications of how their categorisation would be used to decide patient management but, again, this reflects normal practice (since radiologists do not determine management). Readers did not compare serial CXRs directly, but read them in isolation: We note a potential role for monitoring disease progression when serial CXRs are viewed simultaneously, but this outcome would require assessment by other studies.

In conclusion, across a diverse group of clinicians, agreement for BSTI diagnostic categorisation for COVID-19 CXR classification varies widely for many categories, and to such a degree that may render CXR ineffective for triage using such categories. Agreement for serial change over time was also moderate, underscoring the need for cautious interpretation of changes in severity scores if using these to guide management and predict outcome, when these scores have been assigned to serial CXRs read in isolation.

Abbreviations

British Society of Thoracic Imaging

Consultant chest radiologists

Radiographic Assessment of Lung Edema (RALE) score, modified for COVID-19 interpretation

BSTI COVID-19 chest radiograph category

Chest radiograph

General consultant radiologists

Infectious diseases consultants and senior trainees

National Health Service

Radiology specialist residents in training

Nair A, Rodrigues JCL, Hare S et al (2020) A British Society of Thoracic Imaging statement: considerations in designing local imaging diagnostic algorithms for the COVID-19 pandemic. Clin Radiol 75(5):329–334

Article   CAS   PubMed   PubMed Central   Google Scholar  

Guan WJ, Zhong NS (2020) Clinical characteristics of Covid-19 in China. Reply N Engl J Med 382(19):1861–1862

PubMed   Google Scholar  

Wong HYF, Lam HYS, Fong AH et al (2020) Frequency and distribution of chest radiographic findings in patients positive for COVID-19. Radiology. 296(2):E72–EE8

Article   PubMed   Google Scholar  

Liang W, Liang H, Ou L et al (2020) Development and validation of a clinical risk score to predict the occurrence of critical illness in hospitalized patients with COVID-19. JAMA Intern Med 180(8):1081–1089

Article   CAS   PubMed   Google Scholar  

Public Health England(2020) COVID-19: investigation and initial clinical management of possible cases [updated 12/14/2020. Available from: https://www.gov.uk/government/publications/wuhan-novel-coronavirus-initial-investigation-of-possible-cases/investigation-and-initial-clinical-management-of-possible-cases-of-wuhan-novel-coronavirus-wn-cov-infection . Accessed 23 Dec 2021

Schiaffino S, Tritella S, Cozzi A et al (2020) Diagnostic performance of chest X-ray for COVID-19 pneumonia during the SARS-CoV-2 pandemic in Lombardy, Italy. J Thorac Imaging 35(4):W105–W1W6

Gatti M, Calandri M, Barba M et al (2020) Baseline chest X-ray in coronavirus disease 19 (COVID-19) patients: association with clinical and laboratory data. Radiol Med 125(12):1271–1279

Article   PubMed   PubMed Central   Google Scholar  

Orsi MA, Oliva G, Toluian T, Valenti PC, Panzeri M, Cellina M (2020) Feasibility, reproducibility, and clinical validity of a quantitative Chest X-ray assessment for COVID-19. Am J Trop Med Hyg 103(2):822–827

Toussie D, Voutsinas N, Finkelstein M et al (2020) Clinical and chest radiography features determine patient outcomes in young and middle-aged adults with COVID-19. Radiology. 297(1):E197–E206

Balbi M, Caroli A, Corsi A et al (2021) Chest X-ray for predicting mortality and the need for ventilatory support in COVID-19 patients presenting to the emergency department. Eur Radiol 31(4):1999–2012

Murphy K, Smits H, Knoops AJG et al (2020) COVID-19 on chest radiographs: a multireader evaluation of an artificial intelligence system. Radiology. 296(3):E166–EE72

Ebrahimian S, Homayounieh F, Rockenbach MABC et al (2021) Artificial intelligence matches subjective severity assessment of pneumonia for prediction of patient outcome and need for mechanical ventilation: a cohort study. Sci Rep 11(1):858

Jang SB, Lee SH, Lee DE et al (2020) Deep-learning algorithms for the interpretation of chest radiographs to aid in the triage of COVID-19 patients: A multicenter retrospective study. PLoS One 15(11):e0242759

Hare SS, Tavare AN, Dattani V et al (2020) Validation of the British Society of Thoracic Imaging guidelines for COVID-19 chest radiograph reporting. Clin Radiol 75(9):710–7e9

Article   PubMed Central   Google Scholar  

Cozzi A, Schiaffino S, Arpaia F et al (2020) Chest x-ray in the COVID-19 pandemic: radiologists' real-world reader performance. Eur J Radiol 132:109272

Reeves RA, Pomeranz C, Gomella AA et al (2021) Performance of a severity score on admission chest radiography in predicting clinical outcomes in hospitalized patients with coronavirus disease (COVID-19). Am J Roentgenol 217(3):623–632. https://doi.org/10.2214/AJR.20.24801

Kemp OJ, Watson DJ, Swanson-Low CL, Cameron JA, Von Vopelius-Feldt J (2020) Comparison of chest X-ray interpretation by Emergency Department clinicians and radiologists in suspected COVID-19 infection: a retrospective cohort study. BJR Open 2(1):20200020

PubMed   PubMed Central   Google Scholar  

Islam N, Ebrahimzadeh S, Salameh J-P et al (2021) Thoracic imaging tests for the diagnosis of COVID‐19. Cochrane Database Syst Rev 3(3):CD013639. https://doi.org/10.1002/14651858.CD013639.pub4

Mallett S, Allen AJ, Graziadio S et al (2020) At what times during infection is SARS-CoV-2 detectable and no longer detectable using RT-PCR-based tests? A systematic review of individual participant data. BMC Med 18(1):346

Çinkooğlu A, Bayraktaroğlu S, Ceylan N, Savaş R (2021) Efficacy of chest X-ray in the diagnosis of COVID-19 pneumonia: comparison with computed tomography through a simplified scoring system designed for triage. Egypt J Radiol Nucl Med 52(1):1–9

Article   Google Scholar  

Jacob J, Alexander D, Baillie JK et al (2020) Using imaging to combat a pandemic: rationale for developing the UK National COVID-19 Chest Imaging Database. Eur Respir J 56(2):2001809

Download references

The authors state that this work has not received any funding.

Author information

Arjun Nair and Alexander Procter contributed equally to this work.

Authors and Affiliations

Department of Radiology, University College London Hospital, 235 Euston Road, London, NW1 2BU, UK

Arjun Nair, Alexander Procter, Asia Ahmed, Mark Duncan, Magali Taylor, Manil Chouhan, Trevor Gaunt, James Roberts, Niels van Vucht, Alan Campbell, Laura May Davis, Rachel Hubbard, Shankar Kumar & Ammaarah Said

Centre for Medical Imaging, University College London, UCL Centre for Medical Imaging, 2nd Floor Charles Bell House, 43-45 Foley Street, London, W1W 7TS, UK

Steve Halligan, Thomas Parry & Sue Mallet

Centre for Medical Image Computing, Department of Computer Science, University College London, 90 High Holborn, Floor 1, London, WC1V 6LJ, UK

Joseph Jacob

Department of Tropical and Infectious Diseases, University College London Hospital, 235 Euston Road, London, NW1 2BU, UK

Xinhui Chan, Tim Cutfield, Akish Luintel, Michael Marks & Neil Stone

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Arjun Nair .

Ethics declarations

The scientific guarantor of this publication is Arjun Nair ([email protected])

Conflict of interest

The authors of this manuscript declare relationships with the following companies:

AN reports a medical advisory role with Aidence BV, an artificial intelligence company; AN reports a consultation role with Faculty Science Limited, an artificial intelligence company.

SH and SM report grants from the National Institute for Health Research (NIHR) outside the submitted work.

JJ reports Consultancy fees from Boehringer Ingelheim, F. Hoffmann-La Roche, GlaxoSmithKline, NHSX; is on the Advisory Boards of Boehringer Ingelheim, F. Hoffmann-La Roche; has received lecture fees from Boehringer Ingelheim, F. Hoffmann-La Roche, Takeda; receives grant funding from GlaxoSmithKline; holds a UK patent (application number 2113765.8).

The other authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article.

Statistics and biometry

One of the authors has significant statistical expertise- Professor Sue Mallet is a Professor in diagnostic and prognostic medical statistics, specialising in clinical trial design, methodology, and systematic reviews. She is a senior author of the paper.

Informed consent

Written informed consent was waived by the Institutional Review Board, as per our ethics approval.

Ethical approval

Institutional Review Board approval was obtained- Our institution granted ethical approval for COVID-19-related imaging studies (Integrated Research Application Service reference IRAS 282063).

Methodology

• retrospective

• observational

• performed at one institution

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

(DOCX 1216 kb)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Nair, A., Procter, A., Halligan, S. et al. Chest radiograph classification and severity of suspected COVID-19 by different radiologist groups and attending clinicians: multi-reader, multi-case study. Eur Radiol 33 , 2096–2104 (2023). https://doi.org/10.1007/s00330-022-09172-w

Download citation

Received : 28 December 2021

Revised : 19 July 2022

Accepted : 24 August 2022

Published : 25 October 2022

Issue Date : March 2023

DOI : https://doi.org/10.1007/s00330-022-09172-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Coronavirus
  • Observer variation
  • Find a journal
  • Publish with us
  • Track your research

Main content

Links to this project.

Loading projects and components...

multi reader multi case study

In multi-reader multi-case study designs, do AI-diagnostic tools improve accuracy outcomes for radiologists in cancer risk stratification?

  • Fork this Project
  • Duplicate template
  • View Forks (0)
  • Bookmark Remove from bookmarks
  • Request access Log in to request access
  • Thomas Packer
  • Muhammad Ayan Shahid
  • Dr Annette Pluddemann

Date created: 2023-10-16 12:32 PM | Last Updated: 2024-02-09 11:58 PM

Category: Project

Description: This systematic review aims to assess MRMC studies using ROC AUC outcome measures to consider whether AI-assisted clinician diagnosis may be beneficial. The review will also examine design, analysis, and reporting of MRMC studies regarding AI to consider how these may influence the data that is ultimately presented.

Link other OSF projects

  • Registrations

The future of AI seems to be as a diagnostic support tool integrated within a radiologist’s workflow, allowing an increase in efficiency and a reduction in errors1. Studies that thus compare AI models with radiologist or clinician-mediated diagnosis may not hig…

Loading citations...

Get more citations

Recent Activity

multi reader multi case study

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Copyright © 2011-2024 Center for Open Science | Terms of Use | Privacy Policy | Status | API TOP Guidelines | Reproducibility Project: Psychology | Reproducibility Project: Cancer Biology

Multi-Reader Multi-Case Study for Performance Evaluation of High-Risk Thyroid Ultrasound with Computer-Aided Detection

Affiliations.

  • 1 Department of Surgery, National Taiwan University Hospital, Taipei 10002, Taiwan.
  • 2 Department of Internal Medicine, National Taiwan University, Taipei 10002, Taiwan.
  • 3 Graduate Institute of Industrial Engineering, National Taiwan University, Taipei 10617, Taiwan.
  • PMID: 32041119
  • PMCID: PMC7072687
  • DOI: 10.3390/cancers12020373

Physicians use sonographic characteristics as a reference for the possible diagnosis of thyroid cancers. The purpose of this study was to investigate whether physicians were more effective in their tentative diagnosis based on the information provided by a computer-aided detection (CAD) system. A computer compared software-defined and physician-adjusted tumor loci. A multicenter, multireader, and multicase (MRMC) study was designed to compare clinician performance without and with the use of CAD. Interobserver variability was also analyzed. Excellent, satisfactory, and poor segmentations were observed in 25.3%, 58.9%, and 15.8% of nodules, respectively. There were 200 patients with 265 nodules in the study set. Nineteen physicians scored the malignancy potential of the nodules. The average area under the curve (AUC) of all readers was 0.728 without CAD and significantly increased to 0.792 with CAD. The average standard deviation of the malignant potential score significantly decreased from 18.97 to 16.29. The mean malignant potential score significantly decreased from 35.01 to 31.24 for benign cases. With the CAD system, an additional 7.6% of malignant nodules would be suggested for further evaluation, and biopsy would not be recommended for an additional 10.8% of benign nodules. The results demonstrated that applying a CAD system would improve clinicians' interpretations and lessen the variability in diagnosis. However, more studies are needed to explore the use of the CAD system in an actual ultrasound diagnostic situation where much more benign thyroid nodules would be seen.

Keywords: computer-aided detection; thyroid cancer; thyroid nodule; ultrasonography.

  • Research article
  • Open access
  • Published: 15 April 2024

What is quality in long covid care? Lessons from a national quality improvement collaborative and multi-site ethnography

  • Trisha Greenhalgh   ORCID: orcid.org/0000-0003-2369-8088 1 ,
  • Julie L. Darbyshire 1 ,
  • Cassie Lee 2 ,
  • Emma Ladds 1 &
  • Jenny Ceolta-Smith 3  

BMC Medicine volume  22 , Article number:  159 ( 2024 ) Cite this article

51 Altmetric

Metrics details

Long covid (post covid-19 condition) is a complex condition with diverse manifestations, uncertain prognosis and wide variation in current approaches to management. There have been calls for formal quality standards to reduce a so-called “postcode lottery” of care. The original aim of this study—to examine the nature of quality in long covid care and reduce unwarranted variation in services—evolved to focus on examining the reasons why standardizing care was so challenging in this condition.

In 2021–2023, we ran a quality improvement collaborative across 10 UK sites. The dataset reported here was mostly but not entirely qualitative. It included data on the origins and current context of each clinic, interviews with staff and patients, and ethnographic observations at 13 clinics (50 consultations) and 45 multidisciplinary team (MDT) meetings (244 patient cases). Data collection and analysis were informed by relevant lenses from clinical care (e.g. evidence-based guidelines), improvement science (e.g. quality improvement cycles) and philosophy of knowledge.

Participating clinics made progress towards standardizing assessment and management in some topics; some variation remained but this could usually be explained. Clinics had different histories and path dependencies, occupied a different place in their healthcare ecosystem and served a varied caseload including a high proportion of patients with comorbidities. A key mechanism for achieving high-quality long covid care was when local MDTs deliberated on unusual, complex or challenging cases for which evidence-based guidelines provided no easy answers. In such cases, collective learning occurred through idiographic (case-based) reasoning , in which practitioners build lessons from the particular to the general. This contrasts with the nomothetic reasoning implicit in evidence-based guidelines, in which reasoning is assumed to go from the general (e.g. findings of clinical trials) to the particular (management of individual patients).

Not all variation in long covid services is unwarranted. Largely because long covid’s manifestations are so varied and comorbidities common, generic “evidence-based” standards require much individual adaptation. In this complex condition, quality improvement resources may be productively spent supporting MDTs to optimise their case-based learning through interdisciplinary discussion. Quality assessment of a long covid service should include review of a sample of individual cases to assess how guidelines have been interpreted and personalized to meet patients’ unique needs.

Study registration

NCT05057260, ISRCTN15022307.

Peer Review reports

The term “long covid” [ 1 ] means prolonged symptoms following SARS-CoV-2 infection not explained by an alternative diagnosis [ 2 ]. It embraces the US term “post-covid conditions” (symptoms beyond 4 weeks) [ 3 ], the UK terms “ongoing symptomatic covid-19” (symptoms lasting 4–12 weeks) and “post covid-19 syndrome” (symptoms beyond 12 weeks) [ 4 ] and the World Health Organization’s “post covid-19 condition” (symptoms occurring beyond 3 months and persisting for at least 2 months) [ 5 ]. Long covid thus defined is extremely common. In UK, for example, 1.8 million of a population of 67 million met the criteria for long covid in early 2023 and 41% of these had been unwell for more than 2 years [ 6 ].

Long covid is characterized by a constellation of symptoms which may include breathlessness, fatigue, muscle and joint pain, chest pain, memory loss and impaired concentration (“brain fog”), sleep disturbance, depression, anxiety, palpitations, dizziness, gastrointestinal problems such as diarrhea, skin rashes and allergy to food or drugs [ 2 ]. These lead to difficulties with essential daily activities such as washing and dressing, impaired exercise tolerance and ability to work, and reduced quality of life [ 2 , 7 , 8 ]. Symptoms typically cluster (e.g. in different patients, long covid may be dominated by fatigue, by breathlessness or by palpitations and dizziness) [ 9 , 10 ]. Long covid may follow a fairly constant course or a relapsing and remitting one, perhaps with specific triggers [ 11 ]. Overlaps between fatigue-dominant subtypes of long covid, myalgic encephalomyelitis and chronic fatigue syndrome have been hypothesized [ 12 ] but at the time of writing remain unproven.

Long covid has been a contested condition from the outset. Whilst long-term sequelae following other coronavirus (SARS and MERS) infections were already well-documented [ 13 ], SARS-CoV-2 was originally thought to cause a short-lived respiratory illness from which the patient either died or recovered [ 14 ]. Some clinicians dismissed protracted or relapsing symptoms as due to anxiety or deconditioning, especially if the patient had not had laboratory-confirmed covid-19. People with long covid got together in online groups and shared accounts of their symptoms and experiences of such “gaslighting” in their healthcare encounters [ 15 , 16 ]. Some groups conducted surveys on their members, documenting the wide range of symptoms listed in the previous paragraph and showing that whilst long covid is more commonly a sequel to severe acute covid-19, it can (rarely) follow a mild or even asymptomatic acute infection [ 17 ].

Early publications on long covid depicted a post-pneumonia syndrome which primarily affected patients who had been hospitalized (and sometimes ventilated) [ 18 , 19 ]. Later, covid-19 was recognized to be a multi-organ inflammatory condition (the pneumonia, for example, was reclassified as pneumonitis ) and its long-term sequelae attributed to a combination of viral persistence, dysregulated immune response (including auto-immunity), endothelial dysfunction and immuno-thrombosis, leading to damage to the lining of small blood vessels and (thence) interference with transfer of oxygen and nutrients to vital organs [ 20 , 21 , 22 , 23 , 24 ]. But most such studies were highly specialized, laboratory-based and written primarily for an audience of fellow laboratory researchers. Despite demonstrating mean differences in a number of metabolic variables, they failed to identify a reliable biomarker that could be used routinely in the clinic to rule a diagnosis of long covid in or out. Whilst the evidence base from laboratory studies grew rapidly, it had little influence on clinical management—partly because most long covid clinics had been set up with impressive speed by front-line clinical teams to address an immediate crisis, with little or no input from immunologists, virologists or metabolic specialists [ 25 ].

Studies of the patient experience revealed wide geographical variation in whether any long covid services were provided and (if they were) which patients were eligible for these and what tests and treatments were available [ 26 ]. An interim UK clinical guideline for long covid had been produced at speed and published in December 2020 [ 27 ], but it was uncertain about diagnostic criteria, investigations, treatments and prognosis. Early policy recommendations for long covid services in England, based on wide consultation across UK, had proposed a tiered service with “tier 1” being supported self-management, “tier 2” generalist assessment and management in primary care, “tier 3” specialist rehabilitation or respiratory follow-up with oversight from a consultant physician and “tier 4” tertiary care for patients with complications or complex needs [ 28 ]. In 2021, ring-fenced funding was allocated to establish 90 multidisciplinary long covid clinics in England [ 29 ]; some clinics were also set up with local funding in Scotland and Wales. These clinics varied widely in eligibility criteria, referral pathways, staffing mix (some had no doctors at all) and investigations and treatments offered. A further policy document on improving long covid services was published in 2022 [ 30 ]; it recommended that specialist long covid clinics should continue, though the long-term funding of these services remains uncertain [ 31 ]. To build the evidence base for delivering long covid services, major programs of publicly funded research were commenced in both UK [ 32 ] and USA [ 33 ].

In short, at the time this study began (late 2021), there appeared to be much scope for a program of quality improvement which would capture fast-emerging research findings, establish evidence-based standards and ensure these were rapidly disseminated and consistently adopted across both specialist long covid services and in primary care.

Quality improvement collaboratives

The quality improvement movement in healthcare was born in the early 1980s when clinicians and policymakers US and UK [ 34 , 35 , 36 , 37 ] began to draw on insights from outside the sector [ 38 , 39 , 40 ]. Adapting a total quality management approach that had previously transformed the Japanese car industry, they sought to improve efficiency, reduce waste, shift to treating the upstream causes of problems (hence preventing disease) and help all services approach the standards of excellence achieved by the best. They developed an approach based on (a) understanding healthcare as a complex system (especially its key interdependencies and workflows), (b) analysing and addressing variation within the system, (c) learning continuously from real-world data and (d) developing leaders who could motivate people and help them change structures and processes [ 41 , 42 , 43 , 44 ].

Quality improvement collaboratives (originally termed “breakthrough collaboratives” [ 45 ]), in which representatives from different healthcare organizations come together to address a common problem, identify best practice, set goals, share data and initiate and evaluate improvement efforts [ 46 ], are one model used to deliver system-wide quality improvement. It is widely assumed that these collaboratives work because—and to the extent that—they identify, interpret and implement high-quality evidence (e.g. from randomized controlled trials).

Research on why quality improvement collaboratives succeed or fail has produced the following list of critical success factors: taking a whole-system approach, selecting a topic and goal that fits with organizations’ priorities, fostering a culture of quality improvement (e.g. that quality is everyone’s job), engagement of everyone (including the multidisciplinary clinical team, managers, patients and families) in the improvement effort, clearly defining people’s roles and contribution, engaging people in preliminary groundwork, providing organizational-level support (e.g. chief executive endorsement, protected staff time, training and support for teams, resources, quality-focused human resource practices, external facilitation if needed), training in specific quality improvement techniques (e.g. plan-do-study-act cycle), attending to the human dimension (including cultivating trust and working to ensure shared vision and buy-in), continuously generating reliable data on both processes (e.g. current practice) and outcomes (clinical, satisfaction) and a “learning system” infrastructure in which knowledge that is generated feeds into individual, team and organizational learning [ 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 ].

The quality improvement collaborative approach has delivered many successes but it has been criticized at a theoretical level for over-simplifying the social science of human motivation and behaviour and for adopting a somewhat mechanical approach to the study of complex systems [ 55 , 56 ]. Adaptations of the original quality improvement methodology (e.g. from Sweden [ 57 , 58 ]) have placed greater emphasis on human values and meaning-making, on the grounds that reducing the complexities of a system-wide quality improvement effort to a set of abstract and generic “success factors” will miss unique aspects of the case such as historical path dependencies, personalities, framing and meaning-making and micropolitics [ 59 ].

Perhaps this explains why, when the abovementioned factors are met, a quality improvement collaborative’s success is more likely but is not guaranteed, as a systematic review demonstrated [ 60 ]. Some well-designed and well-resourced collaboratives addressing clear knowledge gaps produced few or no sustained changes in key outcome measures [ 49 , 53 , 60 , 61 , 62 ]. To identify why this might be, a detailed understanding of a service’s history, current challenges and contextual constraints is needed. This explains our decision, part-way through the study reported here, to collect rich contextual data on participating sites so as to better explain success or failure of our own collaborative.

Warranted and unwarranted variation in clinical practice

A generation ago, Wennberg described most variation in clinical practice as “unwarranted” (which he defined as variation in the utilization of health care services that cannot be explained by variation in patient illness or patient preferences) [ 63 ]. Others coined the term “postcode lottery” to depict how such variation allegedly impacted on health outcomes [ 64 ]. Wennberg and colleagues’ Atlas of Variation , introduced in 1999 [ 65 ], and its UK equivalent, introduced in 2010 [ 66 ], described wide regional differences in the rates of procedures from arthroscopy to hysterectomy, and were used to prompt services to identify and address examples of under-treatment, mis-treatment and over-treatment. Numerous similar initiatives, mostly based on hospital activity statistics, have been introduced around the world [ 66 , 67 , 68 , 69 ]. Sutherland and Levesque’s proposed framework for analysing variation, for example, has three domains: capacity (broadly, whether sufficient resources are allocated at organizational level and whether individuals have the time and headspace to get involved), evidence (the extent to which evidence-based guidelines exist and are followed), and agency (e.g. whether clinicians are engaged with the issue and the effect of patient choice) [ 70 ].

Whilst it is clearly a good idea to identify unwarranted variation in practice, it is also important to acknowledge that variation can be warranted . The very act of measuring and describing variation carries great rhetorical power, since revealing geographical variation in any chosen metric effectively frames this as a problem with a conceptually simple solution (reducing variation) that will appeal to both politicians and the public [ 71 ]. The temptation to expose variation (e.g. via visualizations such as maps) and address it in mechanistic ways should be resisted until we have fully understood the reasons why it exists, which may include perverse incentives, insufficient opportunities to discuss cases with colleagues, weak or absent feedback on practice, unclear decision processes, contested definitions of appropriate care and professional challenges to guidelines [ 72 ].

Research question, aims and objectives

Research question.

What is quality in long covid care and how can it best be achieved?

To identify best practice and reduce unwarranted variation in UK long covid services.

To explain aspects of variation in long covid services that are or may be warranted.

Our original objectives were to:

Establish a quality improvement collaborative for 10 long covid clinics across UK.

Use quality improvement methods in collaboration with patients and clinic staff to prioritize aspects of care to improve. For each priority topic, identify best (evidence-informed) clinical practice, measure performance in each clinic, compare performance with a best practice benchmark and improve performance.

Produce organizational case studies of participating long covid clinics to explain their origins, evolution, leadership, ethos, population served, patient pathways and place in the wider healthcare ecosystem.

Examine these case studies to explain variation in practice, especially in topics where the quality improvement cycle proves difficult to follow or has limited impact.

The LOCOMOTION study

LOCOMOTION (LOng COvid Multidisciplinary consortium Optimising Treatments and services across the NHS) was a 30-month multi-site case study of 10 long covid clinics (8 in England, 1 in Wales and 1 in Scotland), beginning in 2021, which sought to optimise long covid care. Each clinic offered multidisciplinary care to patients referred from primary or secondary care (and, in some cases, self-referred), and held regular multidisciplinary team (MDT) meetings, mostly online via Microsoft Teams, to discuss cases. A study protocol for LOCOMOTION, with details of ethical approvals, management, governance and patient involvement has been published [ 25 ]. The three main work packages addressed quality improvement, technology-supported patient self-management and phenotyping and symptom clustering. This paper reports on the first work package, focusing mainly on qualitative findings.

Setting up the quality improvement collaborative

We broadly followed standard methodology for “breakthrough” quality improvement collaboratives [ 44 , 45 ], with two exceptions. First, because of geographical distance, continuing pandemic precautions and developments in videoconferencing technology, meetings were held online. Second, unlike in the original breakthrough model, patients were included in the collaborative, reflecting the cultural change towards patient partnerships since the model was originally proposed 40 years ago.

Each site appointed a clinical research fellow (doctor, nurse or allied health professional) funded partly by the LOCOMOTION study and partly with clinical sessions; some were existing staff who were backfilled to take on a research role whilst others were new appointments. The quality improvement meetings were held approximately every 8 weeks on Microsoft Teams and lasted about 2 h; there was an agenda and a chair, and meetings were recorded with consent. The clinical research fellow from each clinic attended, sometimes joined by the clinical lead for that site. In the initial meeting, the group proposed and prioritized topics before merging their consensus with the list of priority topics generated separately by patients (there was much overlap but also some differences).

In subsequent meetings, participants attempted to reach consensus on how to define, measure and achieve quality for each priority topic in turn, implement this approach in their own clinic and monitor its impact. Clinical leads prepared illustrative clinical cases and summaries of the research evidence, which they presented using Microsoft Powerpoint; the group then worked towards consensus on the implications for practice through general discussion. Clinical research fellows assisted with literature searches, collected baseline data from their own clinic, prepared and presented anonymized case examples, and contributed to collaborative goal-setting for improvement. Progress on each topic was reviewed at a later meeting after an agreed interval.

An additional element of this work package was semi-structured interviews with 29 patients, recruited from 9 of the 10 participating sites, about their clinic experiences with a view to feeding into service improvement (in the other site, no patient volunteered).

Our patient advisory group initially met separately from the quality improvement collaborative. They designed a short survey of current practice and sent it to each clinic; the results of this informed a prioritization exercise for topics where they considered change was needed. The patient-generated list was tabled at the quality improvement collaborative discussions, but patients were understandably keen to join these discussions directly. After about 9 months, some patient advisory group members joined the regular collaborative meetings. This dynamic was not without its tensions, since sharing performance data requires trust and there were some concerns about confidentiality when real patient cases were discussed with other patients present.

How evidence-informed quality targets were set

At the time the study began, there were no published large-scale randomized controlled trials of any interventions for long covid. We therefore followed a model used successfully in other quality improvement efforts where research evidence was limited or absent or it did not translate unambiguously into models for current services. In such circumstances, the best evidence may be custom and practice in the best-performing units. The quality improvement effort becomes oriented to what one group of researchers called “potentially better practices”—that is, practices that are “developed through analysis of the processes of care, literature review, and site visits” (page 14) [ 73 ]. The idea was that facilitated discussion among clinical teams, drawing on published research where available but also incorporating clinical experience, established practice and systematic analysis of performance data across participating clinics would surface these “potentially better practices”—an approach which, though not formally tested in controlled trials, appears to be associated with improved outcomes [ 46 , 73 ].

Adding an ethnographic component

Following limited progress made on some topics that had been designated high priority, we interviewed all 10 clinical research fellows (either individually or, in two cases, with a senior clinician present) and 18 other clinic staff (five individually plus two groups of 5 and 8), along with additional informal discussions, to explore the challenges of implementing the changes that had been agreed. These interviews were not audiotaped but detailed notes were made and typed up immediately afterwards. It became evident that some aspects of what the collaborative had deemed “evidence-informed” care were contested by front-line clinic staff, perceived as irrelevant to the service they were delivering, or considered impossible to implement. To unpack these issues further, the research protocol was amended to include an ethnographic component.

TG and EL (academic general practitioners) and JLD (a qualitative researcher with a PhD in the patient experience) attended a total of 45 MDT meetings in participating clinics (mostly online or hybrid). Staff were informed in advance that there would be an observer present; nobody objected. We noted brief demographic and clinical details of cases discussed (but no identifying data), dilemmas and uncertainties on which discussions focused, and how different staff members contributed.

TG made 13 in-person visits to participating long covid clinics. Staff were notified in advance; all were happy to be observed. Visits lasted between 5 and 8 h (54 h in total). We observed support staff booking patients in and processing requests and referrals, and shadowed different clinical staff in turn as they saw patients. Patients were informed of our presence and its purpose beforehand and given the opportunity to decline (three of 53 patients approached did). We discussed aspects of each case with the clinician after the patient left. When invited, we took breaks with staff and used these as an opportunity to ask them informally what it was like working in the clinic.

Ethnographic observation, analysis and reporting was geared to generating a rich interpretive account of the clinical, operational and interpersonal features of each clinic—what Van Maanen calls an “impressionist tales” [ 74 ]. Our work was also guided by the principles set out by Golden-Biddle and Locke, namely authenticity (spending time in the field and basing interpretations on these direct observations), plausibility (creating a plausible account through rich persuasive description) and criticality (e.g. reflexively examining our own assumptions) [ 75 ]. Our collection and analysis of qualitative data was informed by our own professional backgrounds (two general practitioners, one physical therapist, two non-clinicians).

In both MDTs and clinics, we took contemporaneous notes by hand and typed these up immediately afterwards.

Data management and analysis

Typed interview notes and field notes from clinics were collated in a set of Word documents, one for each clinic attended. They were analysed thematically [ 76 ] with attention to the literature on quality improvement and variation (see “ Background ”). Interim summaries were prepared on each clinic, setting out the narrative of how it had been established, its ethos and leadership, setting and staffing, population served and key links with other parts of the local healthcare ecosystem.

Minutes and field notes from the quality improvement collaborative meetings were summarized topic by topic, including initial data collected by the researchers-in-residence, improvement actions taken (or attempted) in that clinic, and any follow-up data shared. Progress or lack of it was interpreted in relation to the contextual case summary for that clinic.

Patient cases seen in clinic, and those discussed by MDTs, were summarized as brief case narratives in Word documents. Using the constant comparative method [ 77 ], we produced an initial synthesis of the clinical picture and principles of management based on the first 10 patient cases seen, and refined this as each additional case was added. Demographic and brief clinical and social details were also logged on Excel spreadsheets. When writing up clinical cases, we used the technique of composite case construction (in which we drew on several actual cases to generate a fictitious one, thereby protecting anonymity whilst preserving key empirical findings [ 78 ]); any names reported in this paper are pseudonyms.

Member checking

A summary was prepared for each clinic, including a narrative of the clinic’s own history and a summary of key quality issues raised across the ten clinics. These summaries included examples from real cases in our dataset. These were shared with the clinical research fellow and a senior clinician from the clinic, and amended in response to feedback. We also shared these summaries with representatives from the patient advisory group.

Overview of dataset

This study generated three complementary datasets. First, the video recordings, minutes, and field notes of 12 quality improvement collaborative meetings, along with the evidence summaries prepared for these meetings and clinic summaries (e.g. descriptions of current practice, audits) submitted by the clinical research fellows. This dataset illustrated wide variation in practice, and (in many topics) gaps or ambiguities in the evidence base.

Second, interviews with staff ( n  = 30) and patients ( n  = 29) from the clinics, along with ethnographic field notes (approximately 100 pages) from 13 in-person clinic visits (54 h), including notes on 50 patient consultations (40 face-to-face, 6 telephone, 4 video). This dataset illustrated the heterogeneity among the ten participating clinics.

Third, field notes (approximately 100 pages), including discussions on 244 clinical cases from the 45 MDT meetings (49 h) that we observed. This dataset revealed further similarities and contrasts among clinics in how patients were managed. In particular, it illustrated how, for the complex patients whose cases were presented at these meetings, teams made sense of, and planned for, each case through multidisciplinary dialogue. This dialogue typically began with one staff member presenting a detailed clinical history along with a narrative of how it had affected the patient’s life and what was at stake for them (e.g. job loss), after which professionals from various backgrounds (nursing, physical therapy, occupational therapy, psychology, dietetics, and different medical specialties) joined in a discussion about what to do.

The ten participating sites are summarized in Table  1 .

In the next two sections, we explore two issues—difficulty defining best practice and the heterogeneous nature of the clinics—that were key to explaining why quality, when pursued in a 10-site collaborative, proved elusive. We then briefly summarize patients’ accounts of their experience in the clinics and give three illustrative examples of the elusiveness of quality improvement using selected topics that were prioritized in our collaborative: outcome measures, investigation of palpitations and management of fatigue. In the final section of the results, we describe how MDT deliberations proved crucial for local quality improvement. Further detail on clinical priority topics will be presented in a separate paper.

“Best practice” in long covid: uncertainty and conflict

The study period (September 2021 to December 2023) corresponded with an exponential increase in published research on long covid. Despite this, the quality improvement collaborative found few unambiguous recommendations for practice. This gap between what the research literature offered and what clinical practice needed was partly ontological (relating what long covid is ). One major bone of contention between patients and clinicians (also evident in discussions with our patient advisory group), for example, was how far (and in whom) clinicians should look for and attempt to treat the various metabolic abnormalities that had been documented in laboratory research studies. The literature on this topic was extensive but conflicting [ 20 , 21 , 22 , 23 , 24 , 79 , 80 , 81 , 82 ]; it was heavy on biological detail but light on clinical application.

Patients were often aware of particular studies that appeared to offer plausible molecular or cellular explanations for symptom clusters along with a drug (often repurposed and off-label) whose mechanism of action appeared to be a good fit with the metabolic chain of causation. In one clinic, for example, we were shown an email exchange between a patient (not medically qualified) and a consultant, in which the patient asked them to reconsider their decision not to prescribe low-dose naltrexone, an opioid receptor antagonist with anti-inflammatory properties. The request included a copy of a peer-reviewed academic paper describing a small, uncontrolled pre-post study (i.e. a weak study design) in which this drug appeared to improve symptoms and functional performance in patients with long covid, as well as a mechanistic argument explaining why the patient felt this drug was a plausible choice in their own case.

This patient’s clinician, in common with most clinicians delivering front-line long covid services, considered that the evidence for such mechanism-based therapies was weak. Clinicians generally felt that this evidence, whilst promising, did not yet support routine measurement of clotting factors, antibodies, immune cells or other biomarkers or the prescription of mechanism-based therapies such as antivirals, anti-inflammatories or anticoagulants. Low-dose naltroxone, for example, is currently being tested in at least one randomized controlled trial (see National Clinical Trials Registry NCT05430152), which had not reported at the time of our observations.

Another challenge to defining best practice was the oft-repeated phrase that long covid is a “diagnosis by exclusion”, but the high prevalence of comorbidities meant that the “pure” long covid patient untainted by other potential explanations for their symptoms was a textbook ideal. In one MDT, for example, we observed a discussion about a patient who had had both swab-positive covid-19 and erythema migrans (a sign of Lyme disease) in the weeks before developing fatigue, yet local diagnostic criteria for each condition required the other to be excluded.

The logic of management in most participating clinics was pragmatic: prompt multidisciplinary assessment and treatment with an emphasis on obtaining a detailed clinical history (including premorbid health status), excluding serious complications (“red flags”), managing specific symptom clusters (for example, physical therapy for breathing pattern disorder), treating comorbidities (for example, anaemia, diabetes or menopause) and supporting whole-person rehabilitation [ 7 , 83 ]. The evidentiary questions raised in MDT discussions (which did not include patients) addressed the practicalities of the rehabilitation model (for example, whether cognitive therapy for neurocognitive complications is as effective when delivered online as it is when delivered in-person) rather than the molecular or cellular mechanisms of disease. For example, the question of whether patients with neurocognitive impairment should be tested for micro-clots or treated with anticoagulants never came up in the MDTs we observed, though we did visit a tertiary referral clinic (the tier 4 clinic in site H), whose lead clinician had a research interest in inflammatory coagulopathies and offered such tests to selected patients.

Because long covid typically produces dozens of symptoms that tend to be uniquely patterned in each patient, the uncertainties on which MDT discussions turned were rarely about general evidence of the kind that might be found in a guideline (e.g. how should fatigue be managed?). Rather they concerned particular case-based clinical decisions (e.g. how should this patient’s fatigue be managed, given the specifics of this case?). An example from our field notes illustrates this:

Physical therapist presents the case of a 39-year-old woman who works as a cleaner on an overnight ferry. Has had long covid for 2 years. Main symptoms are shortness of breath and possible anxiety attacks, especially when at work. She has had a course of physical therapy to teach diaphragmatic breathing but has found that focusing on her breathing makes her more anxious. Patient has to do a lot of bending in her job (e.g. cleaning toilets and under seats), which makes her dizzy, but Active Stand Test was normal. She also has very mild tricuspid incompetence [someone reads out a cardiology report—not hemodynamically significant].
Rehabilitation guidelines (e.g. WHO) recommend phased return to work (e.g. with reduced hours) and frequent breaks. “Tricky!” says someone. The job is intense and busy, and the patient can’t afford not to work. Discussion on whether all her symptoms can be attributed to tension and anxiety. Physical therapist who runs the breathing group says, “No, it’s long covid”, and describes severe initial covid-19 episode and results of serial chest X-rays which showed gradual clearing of ground glass shadows. Team discussion centers on how to negotiate reduced working hours in this particular job, given the overnight ferry shifts. --MDT discussion, Site D

This example raises important considerations about the nature of clinical knowledge in long covid. We return to it in the final section of the “ Results ” and in the “ Discussion ”.

Long covid clinics: a heterogeneous context for quality improvement

Most participating clinics had been established in mid-2020 to follow up patients who had been hospitalized (and perhaps ventilated) for severe acute covid-19. As mass vaccination reduced the severity of acute covid-19 for most people, the patient population in all clinics progressively shifted to include fewer “post-ICU [intensive care unit]” patients (in whom respiratory symptoms almost always dominated), and more people referred by their general practitioners or other secondary care specialties who had not been hospitalized for their acute covid-19 infection, and in whom fatigue, brain fog and palpitations were often the most troubling symptoms. Despite these similarities, the ten clinics had very different histories, geographical and material settings, staffing structures, patient pathways and case mix, as Table  1 illustrates. Below, we give more detail on three example sites.

Site C was established as a generalist “assessment-only” service by a general practitioner with an interest in infectious diseases. It is led jointly by that general practitioner and an occupational therapist, assisted by a wide range of other professionals including speech and language therapy, dietetics, clinical psychology and community-based physical therapy and occupational therapy. It has close links with a chronic fatigue service and a pain clinic that have been running in the locality for over 20 years. The clinic, which is entirely virtual (staff consult either from home or from a small side office in the community trust building), is physically located in a low-rise building on the industrial outskirts of a large town, sharing office space with various community-based health and social care services. Following a 1-h telephone consultation by one of the clinical leads, each patient is discussed at the MDT and then either discharged back to their general practitioner with a detailed management plan or referred on to one of the specialist services. This arrangement evolved to address a particular problem in this locality—that many patients with long covid were being referred by their general practitioner to multiple specialties (e.g. respiratory, neurology, fatigue), leading to a fragmented patient experience, unnecessary specialist assessments and wasteful duplication. The generalist assessment by telephone is oriented to documenting what is often a complex illness narrative (including pre-existing physical and mental comorbidities) and working with the patient to prioritize which symptoms or problems to pursue in which order.

Site E, in a well-regarded inner-city teaching hospital, had been set up in 2020 by a respiratory physician. Its initial ethos and rationale had been “respiratory follow-up”, with strong emphasis on monitoring lung damage via repeated imaging and lung function tests and in ensuring that patients received specialist physical therapy to “re-learn” efficient breathing techniques. Over time, this site has tried to accommodate a more multi-system assessment, with the introduction of a consultant-led infectious disease clinic for patients without a dominant respiratory component, reflecting the shift towards a more fatigue-predominant case mix. At the time of our fieldwork, each patient was seen in turn by a physician, psychologist, occupational therapist and respiratory physical therapist (half an hour each) before all four staff reconvened in a face-to-face MDT meeting to form a plan for each patient. But whilst a wide range of patients with diverse symptoms were discussed at these meetings, there remained a strong focus on respiratory pathology (e.g. tracking improvements in lung function and ensuring that coexisting asthma was optimally controlled).

Site F, one of the first long covid clinics in UK, was set up by a rehabilitation consultant who had been drafted to work on the ICU during the first wave of covid-19 in early 2020. He had a longstanding research interest in whole-patient rehabilitation, especially the assessment and management of chronic fatigue and pain. From the outset, clinic F was more oriented to rehabilitation, including vocational rehabilitation to help patients return to work. There was less emphasis on monitoring lung function or pursuing respiratory comorbidities. At the time of our fieldwork, clinic F offered both a community-based service (“tier 2”) led by an occupational therapist, supported by a respiratory physical therapist and psychologist, and a hospital-based service (“tier 3”) led by the rehabilitation consultant, supported by a wider MDT. Staff in both tiers emphasized that each patient needs a full physical and mental assessment and help to set and work towards achievable goals, whilst staying within safe limits so as to avoid post-exertional symptom exacerbation. Because of the research interest of the lead physician, clinic F adapted well to the growing numbers of patients with fatigue and quickly set up research studies on this cohort [ 84 ].

Details of the other seven sites are shown in Table  1 . Broadly speaking, sites B, E, G and H aligned with the “respiratory follow-up” model and sites F and I aligned with the “rehabilitation” model. Sites A and J had a high-volume, multi-tiered service whose community tier aligned with the “holistic GP assessment” model (site C above) and which also offered a hospital-based, rehabilitation-focused tier. The small service in Scotland (site D) had evolved from an initial respiratory focus to become part of the infectious diseases (ME/CFS) service; Lyme disease (another infectious disease whose sequelae include chronic fatigue) was also prevalent in this region.

The patient experience

Whilst the 10 participating clinics were very diverse in staffing, ethos and patient flows, the 29 patient interviews described remarkably consistent clinic experiences. Almost all identified the biggest problem to be the extended wait of several months before they were seen and the limited awareness (when initially referred) of what long covid clinics could provide. Some talked of how they cried with relief when they finally received an appointment. When the quality improvement collaborative was initially established, waiting times and bottlenecks were patients’ the top priority for quality improvement, and this ranking was shared by clinic staff, who were very aware of how much delays and uncertainties in assessment and treatment compounded patients’ suffering. This issue resolved to a large extent over the study period in all clinics as the referral backlog cleared and the incidence of new cases of long covid fell [ 85 ]; it will be covered in more detail in a separate publication.

Most patients in our sample were satisfied with the care they received when they were finally seen in clinic, especially how they finally felt “heard” after a clinician took a full history. They were relieved to receive affirmation of their experience, a diagnosis of what was wrong and reassurance that they were believed. They were grateful for the input of different members of the multidisciplinary teams and commented on the attentiveness, compassion and skill of allied professionals in particular (“she was wonderful, she got me breathing again”—patient BIR145 talking about a physical therapist). One or two patient participants expressed confusion about who exactly they had seen and what advice they had been given, and some did not realize that a telephone assessment had been an actual clinical consultation. A minority expressed disappointment that an expected investigation had not been ordered (one commented that they had not had any blood tests at all). Several had assumed that the help and advice from the long covid clinic would continue to be offered until they were better and were disappointed that they had been discharged after completing the various courses on offer (since their clinic had been set up as an “assessment only” service).

In the next sections, we give examples of topics raised in the quality improvement collaborative and how they were addressed.

Example quality topic 1: Outcome measures

The first topic considered by the quality improvement collaborative was how (that is, using which measures and metrics) to assess and monitor patients with long covid. In the absence of a validated biomarker, various symptom scores and quality of life scales—both generic and disease-specific—were mooted. Site F had already developed and validated a patient-reported outcome measure (PROM), the C19-YRS (Covid-19 Yorkshire Rehabilitation Scale) and used it for both research and clinical purposes [ 86 ]. It was quickly agreed that, for the purposes of generating comparative research findings across the ten clinics, the C19-YRS should be used at all sites and completed by patients three-monthly. A commercial partner produced an electronic version of this instrument and an app for patient smartphones. The quality improvement collaborative also agreed that patients should be asked to complete the EUROQOL EQ5D, a widely used generic health-related quality of life scale [ 87 ], in order to facilitate comparisons between long covid and other chronic conditions.

In retrospect, the discussions which led to the unopposed adoption of these two measures as a “quality” initiative in clinical care were somewhat aspirational. A review of progress at a subsequent quality improvement meeting revealed considerable variation among clinics, with a wide variety of measures used in different clinics to different degrees. Reasons for this variation were multiple. First, although our patient advisory group were keen that we should gather as much data as possible on the patient experience of this new condition, many clinic patients found the long questionnaires exhausting to complete due to cognitive impairment and fatigue. In addition, whilst patients were keen to answer questions on symptoms that troubled them, many had limited patience to fill out repeated surveys on symptoms that did not trouble them (“it almost felt as if I’ve not got long covid because I didn’t feel like I fit the criteria as they were laying it out”—patient SAL001). Staff assisted patients in completing the measures when needed, but this was time-consuming (up to 45 min per instrument) and burdensome for both staff and patients. In clinics where a high proportion of patients required assistance, staff time was the rate-limiting factor for how many instruments got completed. For some patients, one short instrument was the most that could be asked of them, and the clinician made a judgement on which one would be in their best interests on the day.

The second reason for variation was that the clinical diagnosis and management of particular features, complications and comorbidities of long covid required more nuance than was provided by these relatively generic instruments, and the level of detail sought varied with the specialist interest of the clinic (and the clinician). The modified C19-YRS [ 88 ], for example, contained 19 items, of which one asked about sleep quality. But if a patient had sleep difficulties, many clinicians felt that these needed to be documented in more detail—for example using the 8-item Epworth Sleepiness Scale, originally developed for conditions such as narcolepsy and obstructive sleep apnea [ 89 ]. The “Epworth score” was essential currency for referrals to some but not all specialist sleep services. Similarly, the C19-YRS had three items relating to anxiety, depression and post-traumatic stress disorder, but in clinics where there was a strong focus on mental health (e.g. when there was a resident psychologist), patients were usually invited to complete more specific tools (e.g. the Patient Health Questionnaire 9 [ 90 ], a 9-item questionnaire originally designed to assess severity of depression).

The third reason for variation was custom and practice. Ethnographic visits revealed that paper copies of certain instruments were routinely stacked on clinicians’ desks in outpatient departments and also (in some cases) handed out by administrative staff in waiting areas so that patients could complete them before seeing the clinician. These familiar clinic artefacts tended to be short (one-page) instruments that had a long tradition of use in clinical practice. They were not always fit for purpose. For example, the Nijmegen questionnaire was developed in the 1980s to assess hyperventilation; it was validated against a longer, “gold standard” instrument for that condition [ 91 ]. It subsequently became popular in respiratory clinics to diagnose or exclude breathing pattern disorder (a condition in which the normal physiological pattern of breathing becomes replaced with less efficient, shallower breathing [ 92 ]), so much so that the researchers who developed the instrument published a paper to warn fellow researchers that it had not been validated for this purpose [ 93 ]. Whilst a validated 17-item instrument for breathing pattern disorder (the Self-Evaluation of Breathing Questionnaire [ 94 ]) does exist, it is not in widespread clinical use. Most clinics in LOCOMOTION used Nijmegen either on all patients (e.g. as part of a comprehensive initial assessment, especially if the service had begun as a respiratory follow-up clinic) or when breathing pattern disorder was suspected.

In sum, the use of outcome measures in long covid clinics was a compromise between standardization and contingency. On the one hand, all clinics accepted the need to use “validated” instruments consistently. On the other hand, there were sometimes good reasons why they deviated from agreed practice, including mismatch between the clinic’s priorities as a research site, its priorities as a clinical service, and the particular clinical needs of a patient; the clinic’s—and the clinician’s—specialist focus; and long-held traditions of using particular instruments with which staff and patients were familiar.

Example quality topic 2: Postural orthostatic tachycardia syndrome (POTS)

Palpitations (common in long covid) and postural orthostatic tachycardia syndrome (POTS, a disproportionate acceleration in heart rate on standing, the assumed cause of palpitations in many long covid patients) was the top priority for quality improvement identified by our patient advisory group. Reflecting discussions and evidence (of various kinds) shared in online patient communities, the group were confident that POTS is common in long covid patients and that many cases remain undetected (perhaps misdiagnosed as anxiety). Their request that all long covid patients should be “screened” for POTS prompted a search for, and synthesis of, evidence (which we published in the BMJ [ 95 ]). In sum, that evidence was sparse and contested, but, combined with standard practice in specialist clinics, broadly supported the judicious use of the NASA Lean Test [ 96 ]. This test involves repeated measurements of pulse and blood pressure with the patient first lying and then standing (with shoulders resting against a wall).

The patient advisory group’s request that the NASA Lean Test should be conducted on all patients met with mixed responses from the clinics. In site F, the lead physician had an interest in autonomic dysfunction in chronic fatigue and was keen; he had already published a paper on how to adapt the NASA Lean Test for self-assessment at home [ 97 ]. Several other sites were initially opposed. Staff at site E, for example, offered various arguments:

The test is time-consuming, labor-intensive, and takes up space in the clinic which has an opportunity cost in terms of other potential uses;

The test is unvalidated and potentially misleading (there is a high incidence of both false negative and false positive results);

There is no proven treatment for POTS, so there is no point in testing for it;

It is a specialist test for a specialist condition, so it should be done in a specialist clinic where its benefits and limitations are better understood;

Objective testing does not change clinical management since what we treat is the patient’s symptoms (e.g. by a pragmatic trial of lifestyle measures and medication);

People with symptoms suggestive of dysautonomia have already been “triaged out” of this clinic (that is, identified in the initial telephone consultation and referred directly to neurology or cardiology);

POTS is a manifestation of the systemic nature of long covid; it does not need specific treatment but will improve spontaneously as the patient goes through standard interventions such as active pacing, respiratory physical therapy and sleep hygiene;

Testing everyone, even when asymptomatic, runs counter to the ethos of rehabilitation, which is to “de-medicalize” patients so as to better orient them to their recovery journey.

When clinics were invited to implement the NASA Lean Test on a consecutive sample of patients to resolve a dispute about the incidence of POTS (from “we’ve only seen a handful of people with it since the clinic began” to “POTS is common and often missed”), all but one site agreed to participate. The tertiary POTS centre linked to site H was already running the NASA Lean Test as standard on all patients. Site C, which operated entirely virtually, passed the work to the referring general practitioner by making this test a precondition for seeing the patient; site D, which was largely virtual, sent instructions for patients to self-administer the test at home.

The NASA Lean Test study has been published separately [ 98 ]. In sum, of 277 consecutive patients tested across the eight clinics, 20 (7%) had a positive NASA Lean Test for POTS and a further 28 (10%) a borderline result. Six of 20 patients who met the criteria for POTS on testing had no prior history of orthostatic intolerance. The question of whether this test should be used to “screen” all patients was not answered definitively. But the experience of participating in the study persuaded some sceptics that postural changes in heart rate could be severe in some long covid patients, did not appear to be fully explained by their previously held theories (e.g. “functional”, anxiety, deconditioning), and had likely been missed in some patients. The outcome of this particular quality improvement cycle was thus not a wholescale change in practice (for which the evidence base was weak) but a more subtle increase in clinical awareness, a greater willingness to consider testing for POTS and a greater commitment to contribute to research into this contested condition.

More generally, the POTS audit prompted some clinicians to recognize the value of quality improvement in novel clinical areas. One physician who had initially commented that POTS was not seen in their clinic, for example, reflected:

“ Our clinic population is changing. […] Overall there’s far fewer post-ICU patients with ECMO [extra-corporeal membrane oxygenation] issues and far more long covid from the community, and this is the bit our clinic isn’t doing so well on. We’re doing great on breathing pattern disorder; neuro[logists] are helping us with the brain fogs; our fatigue and occupational advice is ok but some of the dysautonomia symptoms that are more prevalent in the people who were not hospitalized – that’s where we need to improve .” -Respiratory physician, site G (from field visit 6.6.23)

Example quality topic 3: Management of fatigue

Fatigue was the commonest symptom overall and a high priority among both patients and clinicians for quality improvement. It often coexisted with the cluster of neurocognitive symptoms known as brain fog, with both conditions relapsing and remitting in step. Clinicians were keen to systematize fatigue management using a familiar clinical framework oriented around documenting a full clinical history, identifying associated symptoms, excluding or exploring comorbidities and alternative explanations (e.g. poor sleep patterns, depression, menopause, deconditioning), assessing how fatigue affects physical and mental function, implementing a program of physical and cognitive therapy that was sensitive to the patient’s condition and confidence level, and monitoring progress using validated patient-reported outcome measures and symptom diaries.

The underpinning logic of this approach, which broadly reflected World Health Organization guidance [ 99 ], was that fatigue and linked cognitive impairment could be a manifestation of many—perhaps interacting—conditions but that a whole-patient (body and mind) rehabilitation program was the cornerstone of management in most cases. Discussion in the quality improvement collaborative focused on issues such as whether fatigue was so severe that it produced safety concerns (e.g. in a person’s job or with childcare), the pros and cons of particular online courses such as yoga, relaxation and mindfulness (many were viewed positively, though the evidence base was considered weak), and the extent to which respiratory physical therapy had a crossover impact on fatigue (systematic reviews suggested that it may do, but these reviews also cautioned that primary studies were sparse, methodologically flawed, and heterogeneous [ 100 , 101 ]). They also debated the strengths and limitations of different fatigue-specific outcome measures, each of which had been developed and validated in a different condition, with varying emphasis on cognitive fatigue, physical fatigue, effect on daily life, and motivation. These instruments included the Modified Fatigue Impact Scale; Fatigue Severity Scale [ 102 ]; Fatigue Assessment Scale; Functional Assessment Chronic Illness Therapy—Fatigue (FACIT-F) [ 103 ]; Work and Social Adjustment Scale [ 104 ]; Chalder Fatigue Scale [ 105 ]; Visual Analogue Scale—Fatigue [ 106 ]; and the EQ5D [ 87 ]. In one clinic (site F), three of these scales were used in combination for reasons discussed below.

Some clinicians advocated melatonin or nutritional supplements (such as vitamin D or folic acid) for fatigue on the grounds that many patients found them helpful and formal placebo-controlled trials were unlikely ever to be conducted. But neurostimulants used in other fatigue-predominant conditions (e.g. brain injury, stroke), which also lacked clinical trial evidence in long covid, were viewed as inappropriate in most patients because of lack of evidence of clear benefit and hypothetical risk of harm (e.g. adverse drug reactions, polypharmacy).

Whilst the patient advisory group were broadly supportive of a whole-patient rehabilitative approach to fatigue, their primary concern was fatiguability , especially post-exertional symptom exacerbation (PESE, also known as “crashes”). In these, the patient becomes profoundly fatigued some hours or days after physical or mental exertion, and this state can last for days or even weeks [ 107 ]. Patients viewed PESE as a “red flag” symptom which they felt clinicians often missed and sometimes caused. They wanted the quality improvement effort to focus on ensuring that all clinicians were aware of the risks of PESE and acted accordingly. A discussion among patients and clinicians at a quality improvement collaborative meeting raised a new research hypothesis—that reducing the number of repeated episodes of PESE may improve the natural history of long covid.

These tensions around fatigue management played out differently in different clinics. In site C (the GP-led virtual clinic run from a community hub), fatigue was viewed as one manifestation of a whole-patient condition. The lead general practitioner used the metaphor of untangling a skein of wool: “you have to find the end and then gently pull it”. The underlying problem in a fatigued patient, for example, might be an undiagnosed physical condition such as anaemia, disturbed sleep, or inadequate pacing. These required (respectively) the chronic fatigue service (comprising an occupational therapist and specialist psychologist and oriented mainly to teaching the techniques of goal-setting and pacing), a “tiredness” work-up (e.g. to exclude anaemia or menopause), investigation of poor sleep (which, not uncommonly, was due to obstructive sleep apnea), and exploration of mental health issues.

In site G (a hospital clinic which had evolved from a respiratory service), patients with fatigue went through a fatigue management program led by the occupational therapist with emphasis on pacing, energy conservation, avoidance of PESE and sleep hygiene. Those without ongoing respiratory symptoms were often discharged back to their general practitioner once they had completed this; there was no consultant follow-up of unresolved fatigue.

In site F (a rehabilitation clinic which had a longstanding interest in chronic fatigue even before the pandemic), active interdisciplinary management of fatigue was commenced at or near the patient’s first visit, on the grounds that the earlier this began, the more successful it would be. In this clinic, patients were offered a more intensive package: a similar occupational therapy-led fatigue course as those in site G, plus input from a dietician to advise on regular balanced meals and caffeine avoidance and a group-based facilitated peer support program which centred on fatigue management. The dietician spoke enthusiastically about how improving diet in longstanding long covid patients often improved fatigue (e.g. because they had often lost muscle mass and tended to snack on convenience food rather than make meals from scratch), though she agreed there was no evidence base from trials to support this approach.

Pursuing local quality improvement through MDTs

Whilst some long covid patients had “textbook” symptoms and clinical findings, many cases were unique and some were fiendishly complex. One clinician commented that, somewhat paradoxically, “easy cases” were often the post-ICU follow-ups who had resolving chest complications; they tended to do well with a course of respiratory physical therapy and a return-to-work program. Such cases were rarely brought to MDT meetings. “Difficult cases” were patients who had not been hospitalized for their acute illness but presented with a months- or years-long history of multiple symptoms with fatigue typically predominant. Each one was different, as the following example (some details of which have been fictionalized to protect anonymity) illustrates.

The MDT is discussing Mrs Fermah, a 65-year-old homemaker who had covid-19 a year ago. She has had multiple symptoms since, including fluctuating fatigue, brain fog, breathlessness, retrosternal chest pain of burning character, dry cough, croaky voice, intermittent rashes (sometimes on eating), lips going blue, ankle swelling, orthopnoea, dizziness with the room spinning which can be triggered by stress, low back pain, aches and pains in the arms and legs and pins and needles in the fingertips, loss of taste and smell, palpitations and dizziness (unclear if postural, but clear association with nausea), headaches on waking, and dry mouth. She is somewhat overweight (body mass index 29) and admits to low mood. Functionally, she is mostly confined to the house and can no longer manage the stairs so has begun to sleep downstairs. She has stumbled once or twice but not fallen. Her social life has ceased and she rarely has the energy to see her grandchildren. Her 70-year-old husband is retired and generally supportive, though he spends most evenings at his club. Comorbidities include glaucoma which is well controlled and overseen by an ophthalmologist, mild club foot (congenital) and stage 1 breast cancer 20 years ago. Various tests, including a chest X-ray, resting and exercise oximetry and a blood panel, were normal except for borderline vitamin D level. Her breathing questionnaire score suggests she does not have breathing pattern disorder. ECG showed first-degree atrioventricular block and left axis deviation. No clinician has witnessed the blue lips. Her current treatment is online group respiratory physical therapy; a home visit is being arranged to assess her climbing stairs. She has declined a psychologist assessment. The consultant asks the nurse who assessed her: “Did you get a feel if this is a POTS-type dizziness or an ENT-type?” She sighs. “Honestly it was hard to tell, bless her.”—Site A MDT

This patient’s debilitating symptoms and functional impairments could all be due to long covid, yet “evidence-based” guidance for how to manage her complex suffering does not exist and likely never will exist. The question of which (if any) additional blood or imaging tests to do, in what order of priority, and what interventions to offer the patient will not be definitively answered by consulting clinical trials involving hundreds of patients, since (even if these existed) the decision involves weighing this patient’s history and the multiple factors and uncertainties that are relevant in her case. The knowledge that will help the MDT provide quality care to Mrs Fermah is case-based knowledge—accumulated clinical experience and wisdom from managing and deliberating on multiple similar cases. We consider case-based knowledge further in the “ Discussion ”.

Summary of key findings

This study has shown that a quality improvement collaborative of UK long covid clinics made some progress towards standardizing assessment and management in some topics, but some variation remained. This could be explained in part by the fact that different clinics had different histories and path dependencies, occupied a different place in the local healthcare ecosystem, served different populations, were differently staffed, and had different clinical interests. Our patient advisory group and clinicians in the quality improvement collaborative broadly prioritized the same topics for improvement but interpreted them somewhat differently. “Quality” long covid care had multiple dimensions, relating to (among other things) service set-up and accessibility, clinical provision appropriate to the patient’s need (including options for referral to other services locally), the human qualities of clinical and support staff, how knowledge was distributed across (and accessible within) the system, and the accumulated collective wisdom of local MDTs in dealing with complex cases (including multiple kinds of specialist expertise as well as relational knowledge of what was at stake for the patient). Whilst both staff and patients were keen to contribute to the quality improvement effort, the burden of measurement was evident: multiple outcome measures, used repeatedly, were resource-intensive for staff and exhausting for patients.

Strengths and limitations of this study

To our knowledge, we are the first to report both a quality improvement collaborative and an in-depth qualitative study of clinical work in long covid. Key strengths of this work include the diverse sampling frame (with sites from three UK jurisdictions and serving widely differing geographies and demographics); the use of documents, interviews and reflexive interpretive ethnography to produce meaningful accounts of how clinics emerged and how they were currently organized; the use of philosophical concepts to analyse data on how MDTs produced quality care on a patient-by-patient basis; and the close involvement of patient co-researchers and coauthors during the research and writing up.

Limitations of the study include its exclusive UK focus (the external validity of findings to other healthcare systems is unknown); the self-selecting nature of participants in a quality improvement collaborative (our patient advisory group suggested that the MDTs observed in this study may have represented the higher end of a quality spectrum, hence would be more likely than other MDTs to adhere to guidelines); and the particular perspective brought by the researchers (two GPs, a physical therapist and one non-clinical person) in ethnographic observations. Hospital specialists or organizational scholars, for example, may have noticed different things or framed what they observed differently.

Explaining variation in long covid care

Sutherland and Levesque’s framework mentioned in the “ Background ” section does not explain much of the variation found in our study [ 70 ]. In terms of capacity, at the time of this study most participating clinics benefited from ring-fenced resources. In terms of evidence, guidelines existed and were not greatly contested, but as illustrated by the case of Mrs Fermah above, many patients were exceptions to the guideline because of complex symptomatology and relevant comorbidities. In terms of agency, clinicians in most clinics were passionately engaged with long covid (they were pioneers who had set up their local clinic and successfully bid for national ring-fenced resources) and were generally keen to support patient choice (though not if the patient requested tests which were unavailable or deemed not indicated).

Astma et al.’s list of factors that may explain variation in practice (see “ Background ”) includes several that may be relevant to long covid, especially that the definition of appropriate care in this condition remains somewhat contested. But lack of opportunity to discuss cases was not a problem in the clinics in our sample. On the contrary, MDT meetings in each locality gave clinicians multiple opportunities to discuss cases with colleagues and reflect collectively on whether and how to apply particular guidelines.

The key problem was not that clinicians disputed the guidelines for managing long covid or were unaware of them; it was that the guidelines were not self-interpreting . Rather, MDTs had to deliberate on the balance of benefits and harms in different aspects of individual cases. In patients whose symptoms suggested a possible diagnosis of POTS (or who suspected themselves of having POTS), for example, these deliberations were sometimes lengthy and nuanced. Should a test result that is not technically in the abnormal range but close to it be treated as diagnostic, given that symptoms point to this diagnosis? If not, should the patient be told that the test excludes POTS or that it is equivocal? If a cardiology opinion has stated firmly that the patient does not have POTS but the cardiologist is not known for their interest in this condition, should a second specialist opinion be sought? If the gold standard “tilt test” [ 108 ] for POTS (usually available only in tertiary centres) is not available locally, does this patient merit a costly out-of-locality referral? Should the patient’s request for a trial of off-label medication, reflecting discussions in an online support group, be honoured? These are the kinds of questions on which MDTs deliberated at length.

The fact that many cases required extensive deliberation does not necessarily justify variation in practice among clinics. But taking into account the clinics’ very different histories, set-up, and local referral pathways, the variation begins to make sense. A patient who is being assessed in a clinic that functions as a specialist chronic fatigue centre and attracts referrals which reflect this interest (e.g. site F in our sample) will receive different management advice from one that functions as a telephone-only generalist assessment centre and refers on to other specialties (site C in our sample). The wide variation in case mix, coupled with the fact that a different proportion of these cases were highly complex in each clinic (and in different ways), suggests that variation in practice may reflect appropriate rather than inappropriate care.

Our patient advisory group affirmed that many of the findings reported here resonated with their own experience, but they raised several concerns. These included questions about patient groups who may have been missed in our sample because they were rarely discussed in MDTs. The decision to take a case to MDT discussion is taken largely by a clinician, and there was evidence from online support groups that some patients’ requests for their case to be taken to an MDT had been declined (though not, to our knowledge, in the clinics participating in the LOCOMOTION study).

We began this study by asking “what is quality in long covid care?”. We initially assumed that this question referred to a generalizable evidence base, which we felt we could identify, and we believed that we could then determine whether long covid clinics were following the evidence base through conventional audits of structure, process, and outcome. In retrospect, these assumptions were somewhat naïve. On the basis of our findings, we suggest that a better (and more individualized) research question might be “to what extent does each patient with long covid receive evidence-based care appropriate to their needs?”. This question would require individual case review on a sample of cases, tracking each patient longitudinally including cross-referrals, and also interviewing the patient.

Nomothetic versus idiographic knowledge

In a series of lectures first delivered in the 1950s and recently republished [ 109 ], psychiatrist Dr Maurice O’Connor Drury drew on the later philosophy of his friend and mentor Ludwig Wittgenstein to challenge what he felt was a concerning trend: that the nomothetic (generalizable, abstract) knowledge from randomized controlled trials (RCTs) was coming to over-ride the idiographic (personal, situated) knowledge about particular patients. Based on Wittgenstein’s writings on the importance of the particular, Drury predicted—presciently—that if implemented uncritically, RCTs would result in worse, not better, care for patients, since it would go hand-in-hand with a downgrading of experience, intuition, subjective judgement, personal reflection, and collective deliberation.

Much conventional quality improvement methodology is built on an assumption that nomothetic knowledge (for example, findings from RCTs and systematic reviews) is a higher form of knowing than idiographic knowledge. But idiographic, case-based reasoning—despite its position at the very bottom of evidence-based medicine’s hierarchy of evidence [ 110 ]—is a legitimate and important element of medical practice. Bioethicist Kathryn Montgomery, drawing on Aristotle’s notion of praxis , considers clinical practice to be an example of case-based reasoning [ 111 ]. Medicine is governed not by hard and fast laws but by competing maxims or rules of thumb ; the essence of judgement is deciding which (if any) rule should be applied in a particular circumstance. Clinical judgement incorporates science (especially the results of well-conducted research) and makes use of available tools and technologies (including guidelines and decision-support algorithms that incorporate research findings). But rather than being determined solely by these elements, clinical judgement is guided both by the scientific evidence and by the practical and ethical question “what is it best to do, for this individual, given these circumstances?”.

In this study, we observed clinical management of, and MDT deliberations on, hundreds of clinical cases. In the more straightforward ones (for example, recovering pneumonitis), guideline-driven care was not difficult to implement and such cases were rarely brought to the MDT. But cases like Mrs Fermah (see last section of “ Results ”) required much discussion on which aspects of which guideline were in the patient’s best interests to bring into play at any particular stage in their illness journey.

Conclusions

One systematic review on quality improvement collaboratives concluded that “ [those] reporting success generally addressed relatively straightforward aspects of care, had a strong evidence base and noted a clear evidence-practice gap in an accepted clinical pathway or guideline” (page 226) [ 60 ]. The findings from this study suggest that to the extent that such collaboratives address clinical cases that are not straightforward, conventional quality improvement methods may be less useful and even counterproductive.

The question “what is quality in long covid care?” is partly a philosophical one. Our findings support an approach that recognizes and values idiographic knowledge —including establishing and protecting a safe and supportive space for deliberation on individual cases to occur and to value and draw upon the collective learning that occurs in these spaces. It is through such deliberation that evidence-based guidelines can be appropriately interpreted and applied to the unique needs and circumstances of individual patients. We suggest that Drury’s warning about the limitations of nomothetic knowledge should prompt a reassessment of policies that rely too heavily on such knowledge, resulting in one-size-fits-all protocols. We also cautiously hypothesize that the need to centre the quality improvement effort on idiographic rather than nomothetic knowledge is unlikely to be unique to long covid. Indeed, such an approach may be particularly important in any condition that is complex, unpredictable, variable in presentation and clinical course, and associated with comorbidities.

Availability of data and materials

Selected qualitative data (ensuring no identifiable information) will be made available to formal research teams on reasonable request to Professor Greenhalgh at the University of Oxford, on condition that they have research ethics approval and relevant expertise. The quantitative data on NASA Lean Test have been published in full in a separate paper [ 98 ].

Abbreviations

Chronic fatigue syndrome

Intensive care unit

Jenny Ceolta-Smith

Julie Darbyshire

LOng COvid Multidisciplinary consortium Optimising Treatments and services across the NHS

Multidisciplinary team

Myalgic encephalomyelitis

Middle East Respiratory Syndrome

National Aeronautics and Space Association

Occupational therapy/ist

Post-exertional symptom exacerbation

Postural orthostatic tachycardia syndrome

Speech and language therapy

Severe Acute Respiratory Syndrome

Trisha Greenhalgh

United Kingdom

United States

World Health Organization

Perego E, Callard F, Stras L, Melville-JÛhannesson B, Pope R, Alwan N. Why the Patient-Made Term “Long Covid” is needed. Wellcome Open Res. 2020;5:224.

Article   Google Scholar  

Greenhalgh T, Sivan M, Delaney B, Evans R, Milne R: Long covid—an update for primary care. bmj 2022;378:e072117.

Centers for Disease Control and Prevention (US): Long COVID or Post-COVID Conditions (updated 16th December 2022). Atlanta: CDC. Accessed 2nd June 2023 at https://www.cdc.gov/coronavirus/2019-ncov/long-term-effects/index.html ; 2022.

National Institute for Health and Care Excellence (NICE) Scottish Intercollegiate Guidelines Network (SIGN) and Royal College of General Practitioners (RCGP): COVID-19 rapid guideline: managing the long-term effects of COVID-19, vol. Accessed 30th January 2022 at https://www.nice.org.uk/guidance/ng188/resources/covid19-rapid-guideline-managing-the-longterm-effects-of-covid19-pdf-51035515742 . London: NICE; 2022.

Organization WH: Post Covid-19 Condition (updated 7th December 2022), vol. Accessed 2nd June 2023 at https://www.who.int/europe/news-room/fact-sheets/item/post-covid-19-condition#:~:text=It%20is%20defined%20as%20the,months%20with%20no%20other%20explanation . Geneva: WHO; 2022.

Office for National Statistics: Prevalence of ongoing symptoms following coronavirus (COVID-19) infection in the UK: 31st March 2023. London: ONS. Accessed 30th May 2023 at https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/datasets/alldatarelatingtoprevalenceofongoingsymptomsfollowingcoronaviruscovid19infectionintheuk ; 2023.

Crook H, Raza S, Nowell J, Young M, Edison P: Long covid—mechanisms, risk factors, and management. bmj 2021;374.

Sudre CH, Murray B, Varsavsky T, Graham MS, Penfold RS, Bowyer RC, Pujol JC, Klaser K, Antonelli M, Canas LS. Attributes and predictors of long COVID. Nat Med. 2021;27(4):626–31.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Reese JT, Blau H, Casiraghi E, Bergquist T, Loomba JJ, Callahan TJ, Laraway B, Antonescu C, Coleman B, Gargano M: Generalisable long COVID subtypes: findings from the NIH N3C and RECOVER programmes. EBioMedicine 2023;87.

Thaweethai T, Jolley SE, Karlson EW, Levitan EB, Levy B, McComsey GA, McCorkell L, Nadkarni GN, Parthasarathy S, Singh U. Development of a definition of postacute sequelae of SARS-CoV-2 infection. JAMA. 2023;329(22):1934–46.

Brown DA, O’Brien KK. Conceptualising Long COVID as an episodic health condition. BMJ Glob Health. 2021;6(9): e007004.

Article   PubMed   Google Scholar  

Tate WP, Walker MO, Peppercorn K, Blair AL, Edgar CD. Towards a Better Understanding of the Complexities of Myalgic Encephalomyelitis/Chronic Fatigue Syndrome and Long COVID. Int J Mol Sci. 2023;24(6):5124.

Ahmed H, Patel K, Greenwood DC, Halpin S, Lewthwaite P, Salawu A, Eyre L, Breen A, Connor RO, Jones A. Long-term clinical outcomes in survivors of severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome coronavirus (MERS) outbreaks after hospitalisation or ICU admission: a systematic review and meta-analysis. J Rehabil Med. 2020;52(5):1–11.

Google Scholar  

World Health Organisation: Clinical management of severe acute respiratory infection (SARI) when COVID-19 disease is suspected: Interim guidance (13th March 2020). Geneva: WHO. Accessed 3rd January 2023 at https://t.co/JpNdP8LcV8?amp=1 ; 2020.

Rushforth A, Ladds E, Wieringa S, Taylor S, Husain L, Greenhalgh T: Long Covid – the illness narratives. Under review for Sociology of Health and Illness 2021.

Russell D, Spence NJ. Chase J-AD, Schwartz T, Tumminello CM, Bouldin E: Support amid uncertainty: Long COVID illness experiences and the role of online communities. SSM-Qual Res Health. 2022;2: 100177.

Article   PubMed   PubMed Central   Google Scholar  

Ziauddeen N, Gurdasani D, O’Hara ME, Hastie C, Roderick P, Yao G, Alwan NA. Characteristics and impact of Long Covid: Findings from an online survey. PLoS ONE. 2022;17(3): e0264331.

Evans RA, McAuley H, Harrison EM, Shikotra A, Singapuri A, Sereno M, Elneima O, Docherty AB, Lone NI, Leavy OC. Physical, cognitive, and mental health impacts of COVID-19 after hospitalisation (PHOSP-COVID): a UK multicentre, prospective cohort study. Lancet Respir Med. 2021;9(11):1275–87.

Sykes DL, Holdsworth L, Jawad N, Gunasekera P, Morice AH, Crooks MG. Post-COVID-19 symptom burden: what is long-COVID and how should we manage it? Lung. 2021;199(2):113–9.

Altmann DM, Whettlock EM, Liu S, Arachchillage DJ, Boyton RJ: The immunology of long COVID. Nat Rev Immunol 2023:1–17.

Klein J, Wood J, Jaycox J, Dhodapkar RM, Lu P, Gehlhausen JR, Tabachnikova A, Greene K, Tabacof L, Malik AA et al : Distinguishing features of Long COVID identified through immune profiling. Nature 2023.

Chen B, Julg B, Mohandas S, Bradfute SB. Viral persistence, reactivation, and mechanisms of long COVID. Elife. 2023;12: e86015.

Wang C, Ramasamy A, Verduzco-Gutierrez M, Brode WM, Melamed E. Acute and post-acute sequelae of SARS-CoV-2 infection: a review of risk factors and social determinants. Virol J. 2023;20(1):124.

Cervia-Hasler C, Brüningk SC, Hoch T, Fan B, Muzio G, Thompson RC, Ceglarek L, Meledin R, Westermann P, Emmenegger M et al Persistent complement dysregulation with signs of thromboinflammation in active Long Covid Science 2024;383(6680):eadg7942.

Sivan M, Greenhalgh T, Darbyshire JL, Mir G, O’Connor RJ, Dawes H, Greenwood D, O’Connor D, Horton M, Petrou S. LOng COvid Multidisciplinary consortium Optimising Treatments and servIces acrOss the NHS (LOCOMOTION): protocol for a mixed-methods study in the UK. BMJ Open. 2022;12(5): e063505.

Rushforth A, Ladds E, Wieringa S, Taylor S, Husain L, Greenhalgh T. Long covid–the illness narratives. Soc Sci Med. 2021;286: 114326.

National Institute for Health and Care Excellence: COVID-19 rapid guideline: managing the long-term effects of COVID-19, vol. Accessed 4th October 2023 at https://www.nice.org.uk/guidance/ng188/resources/covid19-rapid-guideline-managing-the-longterm-effects-of-covid19-pdf-51035515742 . London: NICE 2020.

NHS England: Long COVID: the NHS plan for 2021/22. London: NHS England. Accessed 2nd August 2022 at https://www.england.nhs.uk/coronavirus/documents/long-covid-the-nhs-plan-for-2021-22/ ; 2021.

NHS England: NHS to offer ‘long covid’ sufferers help at specialist centres. London: NHS England. Accessed 10th October 2020 at https://www.england.nhs.uk/2020/10/nhs-to-offer-long-covid-help/ ; 2020 (7th October).

NHS England: The NHS plan for improving long COVID services, vol. Acessed 4th February 2024 at https://www.england.nhs.uk/publication/the-nhs-plan-for-improving-long-covid-services/ .London: Gov.uk; 2022.

NHS England: Commissioning guidance for post-COVID services for adults, children and young people, vol. Accessed 6th February 2024 at https://www.england.nhs.uk/long-read/commissioning-guidance-for-post-covid-services-for-adults-children-and-young-people/ . London: gov.uk; 2023.

National Institute for Health Research: Researching Long Covid: Adressing a new global health challenge, vol. Accessed 9.8.23 at https://evidence.nihr.ac.uk/collection/researching-long-covid-addressing-a-new-global-health-challenge/ . London: NIHR; 2022.

Subbaraman N. NIH will invest $1 billion to study long COVID. Nature. 2021;591(7850):356–356.

Article   CAS   PubMed   Google Scholar  

Donabedian A. The definition of quality and approaches to its assessment and monitoring. Ann Arbor: Michigan; 1980.

Laffel G, Blumenthal D. The case for using industrial quality management science in health care organizations. JAMA. 1989;262(20):2869–73.

Maxwell RJ. Quality assessment in health. BMJ. 1984;288(6428):1470.

Berwick DM, Godfrey BA, Roessner J. Curing health care: New strategies for quality improvement. The Journal for Healthcare Quality (JHQ). 1991;13(5):65–6.

Deming WE. Out of the Crisis. Cambridge, MA: MIT Press; 1986.

Argyris C: Increasing leadership effectiveness: New York: J. Wiley; 1976.

Juran JM: A history of managing for quality: The evolution, trends, and future directions of managing for quality: Asq Press; 1995.

Institute of Medicine (US): Crossing the Quality Chasm: A New Health System for the 21st Century. Washington, DC: National Academy Press; 2001.

McNab D, McKay J, Shorrock S, Luty S, Bowie P. Development and application of ‘systems thinking’ principles for quality improvement. BMJ Open Qual. 2020;9(1): e000714.

Sampath B, Rakover J, Baldoza K, Mate K, Lenoci-Edwards J, Barker P. ​Whole-System Quality: A Unified Approach to Building Responsive, Resilient Health Care Systems. Boston: Institute for Healthcare Immprovement; 2021.

Batalden PB, Davidoff F: What is “quality improvement” and how can it transform healthcare? In . , vol. 16: BMJ Publishing Group Ltd; 2007: 2–3.

Baker G. Collaborating for improvement: the Institute for Healthcare Improvement’s breakthrough series. New Med. 1997;1:5–8.

Plsek PE. Collaborating across organizational boundaries to improve the quality of care. Am J Infect Control. 1997;25(2):85–95.

Ayers LR, Beyea SC, Godfrey MM, Harper DC, Nelson EC, Batalden PB. Quality improvement learning collaboratives. Qual Manage Healthcare. 2005;14(4):234–47.

Brandrud AS, Schreiner A, Hjortdahl P, Helljesen GS, Nyen B, Nelson EC. Three success factors for continual improvement in healthcare: an analysis of the reports of improvement team members. BMJ Qual Saf. 2011;20(3):251–9.

Dückers ML, Spreeuwenberg P, Wagner C, Groenewegen PP. Exploring the black box of quality improvement collaboratives: modelling relations between conditions, applied changes and outcomes. Implement Sci. 2009;4(1):1–12.

Nadeem E, Olin SS, Hill LC, Hoagwood KE, Horwitz SM. Understanding the components of quality improvement collaboratives: a systematic literature review. Milbank Q. 2013;91(2):354–94.

Shortell SM, Marsteller JA, Lin M, Pearson ML, Wu S-Y, Mendel P, Cretin S, Rosen M: The role of perceived team effectiveness in improving chronic illness care. Medical Care 2004:1040–1048.

Wilson T, Berwick DM, Cleary PD. What do collaborative improvement projects do? Experience from seven countries. Joint Commission J Qual Safety. 2004;30:25–33.

Schouten LM, Hulscher ME, van Everdingen JJ, Huijsman R, Grol RP. Evidence for the impact of quality improvement collaboratives: systematic review. BMJ. 2008;336(7659):1491–4.

Hulscher ME, Schouten LM, Grol RP, Buchan H. Determinants of success of quality improvement collaboratives: what does the literature show? BMJ Qual Saf. 2013;22(1):19–31.

Dixon-Woods M, Bosk CL, Aveling EL, Goeschel CA, Pronovost PJ. Explaining Michigan: developing an ex post theory of a quality improvement program. Milbank Q. 2011;89(2):167–205.

Bate P, Mendel P, Robert G: Organizing for quality: the improvement journeys of leading hospitals in Europe and the United States: CRC Press; 2007.

Andersson-Gäre B, Neuhauser D. The health care quality journey of Jönköping County Council. Sweden Qual Manag Health Care. 2007;16(1):2–9.

Törnblom O, Stålne K, Kjellström S. Analyzing roles and leadership in organizations from cognitive complexity and meaning-making perspectives. Behav Dev. 2018;23(1):63.

Greenhalgh T, Russell J. Why Do Evaluations of eHealth Programs Fail? An Alternative Set of Guiding Principles. PLoS Med. 2010;7(11): e1000360.

Wells S, Tamir O, Gray J, Naidoo D, Bekhit M, Goldmann D. Are quality improvement collaboratives effective? A systematic review. BMJ Qual Saf. 2018;27(3):226–40.

Landon BE, Wilson IB, McInnes K, Landrum MB, Hirschhorn L, Marsden PV, Gustafson D, Cleary PD. Effects of a quality improvement collaborative on the outcome of care of patients with HIV infection: the EQHIV study. Ann Intern Med. 2004;140(11):887–96.

Mittman BS. Creating the evidence base for quality improvement collaboratives. Ann Intern Med. 2004;140(11):897–901.

Wennberg JE. Unwarranted variations in healthcare delivery: implications for academic medical centres. BMJ. 2002;325(7370):961–4.

Bungay H. Cancer and health policy: the postcode lottery of care. Soc Policy Admin. 2005;39(1):35–48.

Wennberg JE, Cooper MM: The Quality of Medical Care in the United States: A Report on the Medicare Program: The Dartmouth Atlas of Health Care 1999: The Center for the Evaluative Clinical Sciences [Internet]. 1999.

DaSilva P, Gray JM. English lessons: can publishing an atlas of variation stimulate the discussion on appropriateness of care? Med J Aust. 2016;205(S10):S5–7.

Gray WK, Day J, Briggs TW, Harrison S. Identifying unwarranted variation in clinical practice between healthcare providers in England: Analysis of administrative data over time for the Getting It Right First Time programme. J Eval Clin Pract. 2021;27(4):743–50.

Wabe N, Thomas J, Scowen C, Eigenstetter A, Lindeman R, Georgiou A. The NSW Pathology Atlas of Variation: Part I—Identifying Emergency Departments With Outlying Laboratory Test-Ordering Practices. Ann Emerg Med. 2021;78(1):150–62.

Jamal A, Babazono A, Li Y, Fujita T, Yoshida S, Kim SA. Elucidating variations in outcomes among older end-stage renal disease patients on hemodialysis in Fukuoka Prefecture, Japan. PLoS ONE. 2021;16(5): e0252196.

Sutherland K, Levesque JF. Unwarranted clinical variation in health care: definitions and proposal of an analytic framework. J Eval Clin Pract. 2020;26(3):687–96.

Tanenbaum SJ. Reducing variation in health care: The rhetorical politics of a policy idea. J Health Polit Policy Law. 2013;38(1):5–26.

Atsma F, Elwyn G, Westert G. Understanding unwarranted variation in clinical practice: a focus on network effects, reflective medicine and learning health systems. Int J Qual Health Care. 2020;32(4):271–4.

Horbar JD, Rogowski J, Plsek PE, Delmore P, Edwards WH, Hocker J, Kantak AD, Lewallen P, Lewis W, Lewit E. Collaborative quality improvement for neonatal intensive care. Pediatrics. 2001;107(1):14–22.

Van Maanen J: Tales of the field: On writing ethnography: University of Chicago Press; 2011.

Golden-Biddle K, Locke K. Appealing work: An investigation of how ethnographic texts convince. Organ Sci. 1993;4(4):595–616.

Braun V, Clarke V. Using thematic analysis in psychology. Qual Res Psychol. 2006;3(2):77–101.

Glaser BG. The constant comparative method of qualitative analysis. Soc Probl. 1965;12:436–45.

Willis R. The use of composite narratives to present interview findings. Qual Res. 2019;19(4):471–80.

Vojdani A, Vojdani E, Saidara E, Maes M. Persistent SARS-CoV-2 Infection, EBV, HHV-6 and other factors may contribute to inflammation and autoimmunity in long COVID. Viruses. 2023;15(2):400.

Choutka J, Jansari V, Hornig M, Iwasaki A. Unexplained post-acute infection syndromes. Nat Med. 2022;28(5):911–23.

Connors JM, Ariëns RAS. Uncertainties about the roles of anticoagulation and microclots in postacute sequelae of severe acute respiratory syndrome coronavirus 2 infection. J Thromb Haemost. 2023;21(10):2697–701.

Patel MA, Knauer MJ, Nicholson M, Daley M, Van Nynatten LR, Martin C, Patterson EK, Cepinskas G, Seney SL, Dobretzberger V. Elevated vascular transformation blood biomarkers in Long-COVID indicate angiogenesis as a key pathophysiological mechanism. Mol Med. 2022;28(1):122.

Greenhalgh T, Sivan M, Delaney B, Evans R, Milne R: Long covid—an update for primary care. bmj 2022, 378.

Parkin A, Davison J, Tarrant R, Ross D, Halpin S, Simms A, Salman R, Sivan M. A multidisciplinary NHS COVID-19 service to manage post-COVID-19 syndrome in the community. J Prim Care Commun Health. 2021;12:21501327211010990.

NHS England: COVID-19 Post-Covid Assessment Service, vol. Accessed 5th March 2024 at https://www.england.nhs.uk/statistics/statistical-work-areas/covid-19-post-covid-assessment-service/ . London: NHS England; 2024.

Sivan M, Halpin S, Gee J, Makower S, Parkin A, Ross D, Horton M, O'Connor R: The self-report version and digital format of the COVID-19 Yorkshire Rehabilitation Scale (C19-YRS) for Long Covid or Post-COVID syndrome assessment and monitoring. Adv Clin Neurosci Rehabil 2021;20(3).

The EuroQol Group. EuroQol-a new facility for the measurement of health-related quality of life. Health Policy. 1990;16(3):199–208.

Sivan M, Preston NJ, Parkin A, Makower S, Gee J, Ross D, Tarrant R, Davison J, Halpin S, O’Connor RJ, et al. The modified COVID-19 Yorkshire Rehabilitation Scale (C19-YRSm) patient-reported outcome measure for Long Covid or Post-COVID syndrome. J Med Virol. 2022;94(9):4253–64.

Johns MW. A new method for measuring daytime sleepiness: the Epworth sleepiness scale. Sleep. 1991;14(6):540–5.

Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001;16(9):606–13.

Van Dixhoorn J, Duivenvoorden H. Efficacy of Nijmegen Questionnaire in recognition of the hyperventilation syndrome. J Psychosom Res. 1985;29(2):199–206.

Evans R, Pick A, Lardner R, Masey V, Smith N, Greenhalgh T: Breathing difficulties after covid-19: a guide for primary care. BMJ 2023;381.

Van Dixhoorn J, Folgering H: The Nijmegen Questionnaire and dysfunctional breathing. In . , vol. 1: Eur Respiratory Soc; 2015.

Courtney R, Greenwood KM. Preliminary investigation of a measure of dysfunctional breathing symptoms: The Self Evaluation of Breathing Questionnaire (SEBQ). Int J Osteopathic Med. 2009;12(4):121–7.

Espinosa-Gonzalez A, Master H, Gall N, Halpin S, Rogers N, Greenhalgh T. Orthostatic tachycardia after covid-19. BMJ (Clinical Research ed). 2023;380:e073488–e073488.

PubMed   Google Scholar  

Bungo M, Charles J, Johnson P Jr. Cardiovascular deconditioning during space flight and the use of saline as a countermeasure to orthostatic intolerance. Aviat Space Environ Med. 1985;56(10):985–90.

CAS   PubMed   Google Scholar  

Sivan M, Corrado J, Mathias C. The Adapted Autonomic Profile (Aap) Home-Based Test for the Evaluation of Neuro-Cardiovascular Autonomic Dysfunction. Adv Clin Neurosci Rehabil. 2022;3:10–13. https://doi.org/10.47795/QKBU46715 .

Lee C, Greenwood DC, Master H, Balasundaram K, Williams P, Scott JT, Wood C, Cooper R, Darbyshire JL, Gonzalez AE. Prevalence of orthostatic intolerance in long covid clinic patients and healthy volunteers: A multicenter study. J Med Virol. 2024;96(3): e29486.

World Health Organization: Clinical management of covid-19 - living guideline. Geneva: WHO. Accessed 4th October 2023 at https://www.who.int/publications/i/item/WHO-2019-nCoV-clinical-2021-2 ; 2023.

Ahmed I, Mustafaoglu R, Yeldan I, Yasaci Z, Erhan B: Effect of pulmonary rehabilitation approaches on dyspnea, exercise capacity, fatigue, lung functions and quality of life in patients with COVID-19: A Systematic Review and Meta-Analysis. Arch Phys Med Rehabil 2022.

Dillen H, Bekkering G, Gijsbers S, Vande Weygaerde Y, Van Herck M, Haesevoets S, Bos DAG, Li A, Janssens W, Gosselink R, et al. Clinical effectiveness of rehabilitation in ambulatory care for patients with persisting symptoms after COVID-19: a systematic review. BMC Infect Dis. 2023;23(1):419.

Learmonth Y, Dlugonski D, Pilutti L, Sandroff B, Klaren R, Motl R. Psychometric properties of the fatigue severity scale and the modified fatigue impact scale. J Neurol Sci. 2013;331(1–2):102–7.

Webster K, Cella D, Yost K. The Functional Assessment of Chronic Illness T herapy (FACIT) Measurement System: properties, applications, and interpretation. Health Qual Life Outcomes. 2003;1(1):1–7.

Mundt JC, Marks IM, Shear MK, Greist JM. The Work and Social Adjustment Scale: a simple measure of impairment in functioning. Br J Psychiatry. 2002;180(5):461–4.

Chalder T, Berelowitz G, Pawlikowska T, Watts L, Wessely S, Wright D, Wallace E. Development of a fatigue scale. J Psychosom Res. 1993;37(2):147–53.

Shahid A, Wilkinson K, Marcu S, Shapiro CM: Visual analogue scale to evaluate fatigue severity (VAS-F). In: STOP, THAT and one hundred other sleep scales . edn.: Springer; 2011:399–402.

Parker M, Sawant HB, Flannery T, Tarrant R, Shardha J, Bannister R, Ross D, Halpin S, Greenwood DC, Sivan M. Effect of using a structured pacing protocol on post-exertional symptom exacerbation and health status in a longitudinal cohort with the post-COVID-19 syndrome. J Med Virol. 2023;95(1): e28373.

Kenny RA, Bayliss J, Ingram A, Sutton R. Head-up tilt: a useful test for investigating unexplained syncope. The Lancet. 1986;327(8494):1352–5.

Drury MOC: Science and Psychology. In: The selected writings of Maurice O’Connor Drury: On Wittgenstein, philosophy, religion and psychiatry. edn.: Bloomsbury Publishing; 2017.

Concato J, Shah N, Horwitz RI. Randomized, controlled trials, observational studies, and the hierarchy of research designs. N Engl J Med. 2000;342(25):1887–92.

Mongtomery K: How doctors think: Clinical judgment and the practice of medicine: Oxford University Press; 2005.

Download references

Acknowledgements

We are grateful to clinic staff for allowing us to study their work and to patients for allowing us to sit in on their consultations. We also thank the funder of LOCOMOTION (National Institute for Health Research) and the patient advisory group for lived experience input.

This research is supported by National Institute for Health Research (NIHR) Long Covid Research Scheme grant (Ref COV-LT-0016).

Author information

Authors and affiliations.

Nuffield Department of Primary Care Health Sciences, University of Oxford, Woodstock Rd, Oxford, OX2 6GG, UK

Trisha Greenhalgh, Julie L. Darbyshire & Emma Ladds

Imperial College Healthcare NHS Trust, London, UK

LOCOMOTION Patient Advisory Group and Lived Experience Representative, London, UK

You can also search for this author in PubMed   Google Scholar

Contributions

TG conceptualized the overall study, led the empirical work, supported the quality improvement meetings, conducted the ethnographic visits, led the data analysis, developed the theorization and wrote the first draft of the paper. JLD organized and led the quality improvement meetings, supported site-based researchers to collect and analyse data on their clinic, collated and summarized data on quality topics, and liaised with the patient advisory group. CL conceptualized and led the quality topic on POTS, including exploring reasons for some clinics’ reluctance to conduct testing and collating and analysing the NASA Lean Test data across all sites. EL assisted with ethnographic visits, data analysis, and theorization. JCS contributed lived experience of long covid and also clinical experience as an occupational therapist; she liaised with the wider patient advisory group, whose independent (patient-led) audit of long covid clinics informed the quality improvement prioritization exercise. All authors provided extensive feedback on drafts and contributed to discussions and refinements. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Trisha Greenhalgh .

Ethics declarations

Ethics approval and consent to participate.

LOng COvid Multidisciplinary consortium Optimising Treatments and servIces acrOss the NHS study is sponsored by the University of Leeds and approved by Yorkshire & The Humber—Bradford Leeds Research Ethics Committee (ref: 21/YH/0276) and subsequent amendments.

Patient participants in clinic were approached by the clinician (without the researcher present) and gave verbal informed consent for a clinically qualified researcher to observe the consultation. If they consented, the researcher was then invited to sit in. A written record was made in field notes of this verbal consent. It was impractical to seek consent from patients whose cases were discussed (usually with very brief clinical details) in online MDTs. Therefore, clinical case examples from MDTs presented in the paper are fictionalized cases constructed from multiple real cases and with key clinical details changed (for example, comorbidities were replaced with different conditions which would produce similar symptoms). All fictionalized cases were checked by our patient advisory group to check that they were plausible to lived experience experts.

Consent for publication

No direct patient cases are reported in this manuscript. For details of how the fictionalized cases were constructed and validated, see “Consent to participate” above.

Competing interests

TG was a member of the UK National Long Covid Task Force 2021–2023 and on the Oversight Group for the NICE Guideline on Long Covid 2021–2022. She is a member of Independent SAGE.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Greenhalgh, T., Darbyshire, J.L., Lee, C. et al. What is quality in long covid care? Lessons from a national quality improvement collaborative and multi-site ethnography. BMC Med 22 , 159 (2024). https://doi.org/10.1186/s12916-024-03371-6

Download citation

Received : 04 December 2023

Accepted : 26 March 2024

Published : 15 April 2024

DOI : https://doi.org/10.1186/s12916-024-03371-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Post-covid-19 syndrome
  • Quality improvement
  • Breakthrough collaboratives
  • Warranted variation
  • Unwarranted variation
  • Improvement science
  • Ethnography
  • Idiographic reasoning
  • Nomothetic reasoning

BMC Medicine

ISSN: 1741-7015

multi reader multi case study

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Evaluation of pseudo-reader study designs to estimate observer performance results as an alternative to fully crossed, multi-reader, multi-case studies

Rickey e. carter.

a Department of Health Sciences Research, Mayo Clinic, 4500 San Pablo Road South Jacksonville, FL 32224

David R. Holmes, III

b Department of Physiology and Biomedical Engineering, Mayo Clinic, 200 First Street SW Rochester MN 55905

Joel G. Fletcher

c Department of Radiology, Mayo Clinic, 200 First Street SW Rochester MN 55905

Cynthia H. McCollough

Rationale and objectives:.

To examine the ability of a pseudo-reader study design to estimate the observer performance obtained using a traditional fully crossed, multi-reader, multi-case (MRMC) study.

Materials and Methods:

A ten-reader MRMC study with 20 computed tomography datasets was designed to measure observer performance on four novel noise reduction methods. This study served as the foundation for the empirical evaluation of three different pseudo-reader designs, each of which used a similar bootstrap approach for generating 2000 realizations from the fully crossed study. Our three approaches to generating a pseudo-reader varied in the degree to which reader performance was matched and integrated into the pseudo-reader design. One randomly selected simulation was selected as a “mock study” to represent a hypothetical, prospective implementation of the design.

Using the traditional fully crossed design, figures of merit (FOM) (95% CIs) for the four noise reductions methods were 68.2 (55.5 – 81.0), 69.6 (58.4 – 80.8), 70.8 (60.2 – 81.4), and 70.9 (60.4 – 81.3), respectively. When radiologists’ performances on the fourth noise reduction method were used to pair readers during the mock study, there was strong agreement in the estimated FOMs with estimates using the pseudo-reader design being within +/−3% of the fully crossed design.

Conclusion:

Fully crossed MRMC studies require significant investment in resources and time, often resulting in delayed implementation or minimal human testing before dissemination. The pseudo-reader approach accelerates study conduct by combining readers judiciously and was found to provide comparable results to the traditional fully crossed design by making strong assumptions about exchangeability of the readers.

Introduction

With the ongoing development of new medical image acquisition and reconstruction methods comes the need to objectively measure how human (e.g., radiologist) performance changes with the new or altered appearance to medical images ( 1 , 2 ). This is an essential step in confirming the diagnostic accuracy of the imaging technique prior to its adoption in clinical practice. To date, the standard approach is to utilize a multi-reader, multi-case (MRMC) study design wherein a large number of readers evaluate a large number of cases examining different imaging alternatives. MRMC designs are categorized as fully crossed when each reader reviews every patient case using each imaging strategy. The resources to conduct such a study are exhaustive. For example, if there are 10 radiologists reviewing 100 cases for five different imaging strategies (e.g., radiation dose levels), each radiologist must review 500 datasets, with the total reading burden for the study being 5000 total reader interpretations. Frequently, these reading interpretations need to be spread out over time, sometimes months apart, in order to minimize bias related to recall of conspicuous attributes in the patient/dataset, so fully crossed MRMC studies do not provide an expeditious path to evaluating new imaging alternatives. The evidence base for new imaging strategies may lag considerably behind technology development., and in some cases,the technologies under study may no longer considered state-of-the-art by the time the results of MRMC studies are disseminated.

To address these limitations, alternative designs to help accelerate the conduct of the human observer performance studies have been proposed( 3 – 5 ). A pseudo-reader study design is an emerging alternative to a fully crossed MRMC study based on the concept of a Latin square experimental design( 3 ). This design distributes the reading assignments over a large number of readers using an a priori plan to combine individual readers back into one or more virtual, or as now denoted “pseudo-“, reader(s) for analysis. While the design is hypothesized to allow for much greater flexibility with reading schedules and accelerate the time required to quantify human observer performance, the operating characteristics, limitations, requirements and performance of this approach are not fully described or understood.

In 2016, our team, in collaboration with the American Association of Physicists in Medicine (AAPM) and the National Institute of Biomedical Imaging and Bioengineering, conducted the 2016 Low Dose CT Grand Challenge.( 3 ) In this challenge, multiple institutions received low-dose CT images or projection data from contrast-enhanced abdominal CT examinations, and returned denoised images, with the winner to be declared based on the correct visual detection of hepatic metastases. Because of the relatively short time frame over which the challenge was conducted, a pseudo-reader approach was used to assess the submitted image denoising and iterative reconstruction approaches. The purpose of the present study is to empirically determine the level of agreement between the pseudo-reader design and a fully crossed MRMC study design by conducting a follow-up study on selected submissions to the AAPM CT Grand Challenge.

Institutional review board approval was obtained for this HIPAA-compliant study. Radiologists participating in this study provided oral consent according to the instructions related to the institutional review board approval.

Validation Study Design

We selected images returned by four of the sites participating in the AAPM CT Low Dose Grand Challenge for inclusion in this study; 2 sites performed projection space iterative reconstruction to reduce noise and 2 sites used image space noise reduction techniques. This reuse of the AAPM CT Low Dose Grand Challenge was consistent with the signed data sharing agreement associated with the challenge( 6 ). To blind this study to the original competition results, the noise reduction method (NRM) used by each of the sites is simply referred to by the designation NRM A – D ( Figure 1 ). As part of the original grand challenge, these four NRMs were applied to 20 simulated low-dose CT patient datasets, which had been prepared by using a validated technique to insert noise into the measured projection data( 7 ). All radiologist reader interpretations reported herein are unique to the current work, and were not reported as results in the Grand Challenge( 3 ). Details relating to reference standards for the contrast-enhanced CT data in the study has been previously published( 3 ).

An external file that holds a picture, illustration, etc.
Object name is nihms-1528815-f0001.jpg

Representative slices from the four noise reduction methods along with the reference slice. The number of detections out of the 10 readers is noted for each noise reduction method.

A total of 10 radiology trainees (senior residents and fellows) volunteered for participation as readers in this study and provided oral informed consent prior to participating. Prior to initiation of reads for the study, the readers received standardized training on participation in MRMC studies by a radiologist co-investigator (J. G. F. with experience in reader training( 8 , 9 ) and 20 years as staff abdominal radiologist), including training on the correct use of confidence scales and the operation of the custom reader interpretation workstation. The primary task for this study was the detection of an undisclosed number of hepatic metastases spread over the 20 patients. The MRMC design utilized the standard fully crossed design such that each reader would interpret a total of 80 datasets. Four reading sessions were required for each radiologist using randomized reading worklists such that readers only reviewed each patient’s CT images only once in a given reading session. Total reading time for the study spanned 88 days (07-21-2017 – 10-17-2017) with a median of 21 days [interquartile range: 21 to 25 days] separating reading sessions. Trainees evaluated each case visually on a specially designed computer workstation, circling all liver lesions, and assigning a confidence score (from 0 to 100) for their confidence that the circled lesion represented a hepatic metastases( 9 ).

Pseudo-Reader Study Design Assumptions and Specification

The pseudo-reader approach ( Figure 2 ) is built on the principle that readers are exchangeable. At least two conditions need to be tenable to achieve this. Readers should be similarly trained and demonstrate similar behavior during observer performance studies. For the second consideration, which could be influenced by experience level, consistent use of the confidence rating is one of the most critical aspects. As a standard practice in our MRMC studies, all readers receive detailed instruction on the use of the confidence scale via a standardized set of written and oral instructions, with an intention to standardize readers’ use of the confidence scale to minimize the reader-to-reader variation that may occur with inconsistent use of the confidence scale. Both of these assumptions are strong assumptions we considered necessary to validate. To do this, we used data derived from a fully crossed MRMC study to create pseudo-readers and simulated pseudo-reader-based MRMC studies using the following three different strategies.

  • Single Pseudo-Reader . A single pseudo-reader can be derived from the data from a fully crossed MRMC study by randomly sampling one of the reader interpretations from each imaging strategy by patient combination. In the context of this study, there were 10 radiologist interpretations from which to randomly select one representative interpretation for each of the 80 (4 NRM x 20 patient) combinations. Thus, instead of having a 10 reader study with a total of 800 reading interpretations, this pseudo-reader utilized only 80 reading interpretations (one reading for each NRM – patient combination). Note, there are 10^80 possible combinations of single pseudo-readers that can be determined from this study.
  • Performance Stratified Pseudo-Reader . This approach assumes that reader performance varies across readers and that accounting for this variation in the generation of the pseudo-reader will improve precision in the estimation process. To test this concept, a single NRM was selected as if it had been pilot study designed to estimate reader performance figures of merit (FOMs). There are two commonly used approaches for estimating the FOM with a free response paradigm, both generally denoted as jackknife alternative free-response receiver operating characteristic (JAFROC) analysis. The distinction between the two approaches is in how the FOM definitions consider non-localizations (i.e., false positive markings) in positive cases differently. JAFROC1 penalizes the non-localizations in cases with a target lesion whereas JAFROC does not( 10 ). Both FOMs measure subtle differences in reader performance that we desired to account for in our performance estimation process. Accordingly, we constructed a summary composite measure for reader performance that was the mean JAFROC1 and JAFROC reader-specific FOMs. To stratify readers based on performance, the strata were created with the following upper and lower bounds to the mean composite score: [0,.6), [.6, .7), [.7, .8), [.8, .9), and [.9, 1.0). This performance binning resulted in 0, 5, 4, 1, 0, and 0 readers being binned into those categories based on the reader’s composite performance on NRM D, respectively. The end result was three pseudo-readers, one with sampling from 5 readers (i.e., those with FOMs in the range [.6, .7)), one with sampling from four readers, and 1 equaling the data from a single reader.
  • Performance Matched Pseudo-Reader . Using the same composite score (i.e., mean of the JAFROC1 and JAFROC FOMs for NRM D) as the performance stratified pseudo-reader, readers were matched in terms of performance based on the rank order of the estimated performance. For example, the mean FOMs for Readers 8 and 6 were 0.61 and 0.62, which were the two lowest FOMs in this study, were paired together. Similarly, readers 9 and 4, 1 and 7, 3 and 10, and 2 and 5 were also paired, resulting in 5 pseudo-readers for the study.

An external file that holds a picture, illustration, etc.
Object name is nihms-1528815-f0002.jpg

Representation of how a fully crossed design can be translated into a pseudo-reader study design. Colors in the cells represent individual interpretation by up to 10 readers, as shown. In the fully crossed design, all 10 readers would read the entire panel of 20 cases across the four imaging configurations. In contrast, a pseudo-reader design federates the complete reading coverage over multiple readers resulting in fewer total reading interpretations.

For each of the three potential pseudo-reader specifications, a generalized bootstrap approach was utilized to randomly generate possible study results originating from the observed fully crossed study design. The random selection process was such that if a reader’s interpretation for a NRM and patient dataset was selected for incorporation into the pseudo-reader design, all reader marks (lesion localizations and non-lesion localizations), if any, for that case were included as a set. For each potential pseudo-reader design, 2000 replicates were created by randomly selecting a single interpretation for every stratum in the study design. This random selection process within stratum is the manifestation of the exchangeability assumption in the pseudo-reader design. These simulated study results were generated and archived for subsequent analyses.

Mock Study Design

The three pseudo-reader approaches described above were conducive for the general, post-hoc examination of the operating characteristics for each pseudo-reader design; however, the resampling approach used to generate the distribution of FOMs did not directly mimic the use of any of one design for one particular realization of a study’s data in practice. If a design were to be implemented prospectively, the results would need to be based only on a single set of reader interpretations. To simulate the results of an actual study, a mock study was created.

Of the three pseudo-reader designs presented above, the performance matched pseudo-reader approach was selected for use in the mock study. This study design resulted in the largest number of pseudo-readers, five, of the designs considered. This would allow for a more precise estimate of the reader variance component in the JAFROC analysis.

As before, NRM D was arbitrarily selected as the pilot study used to estimate reader performance for matching. The composite index defined as the mean JAFROC1 and JAFROC FOMs was used to match readers into pairs. The reader performance on the remaining three NRMs was estimated in two ways. First, the fully crossed results on NRMs A - C that were obtained using all of the original reader marks obtained in the overall study. This fully crossed result was considered the reference. Then, a single performance matched pseudo-reader study was generated using the paired reader data in order to estimate the FOMs for NRMs A – C (i.e., by randomly selecting one of the two paired observations for each reader stratum - NRM – patient combination). The archived bootstrap replicates generated above were used as the sampling frame for the mock study. In particular, one of the 2000 randomly generated pseudo-reader results was to be selected at random to provide the data for the mock study. For transparency, the following commands were used in R to randomly select the bootstrap replicate that would serve as the study result for the mock study: set.seed(20180816) and head(sample(1:2000), n=1). The randomized selection process resulted in bootstrap replicate number 139 being selected. Note, the seed was set to the numeric date the simulation was run as this is the standard process when randomized reading sets are produced for general MRMC study designs by our team. Accordingly, it was considered the approach that would have been utilized if the mock study had been implemented prospectively. This selection process was also blind to the numeric results of any of the bootstrap replicates.

Statistical Analysis

To summarize human observer performance in the fully crossed study along with all bootstrap replicates generated through the simulation studies, we utilized JAFROC1 FOMs( 10 ). While the utilization of non-localizations in positive cases has been debated( 10 ), this approach was selected since all false positives were of interest and 14 (70%) of the cases had at least one hepatic metastasis. In the context of the bootstrap replicates, the 2000 individual study results were utilized to generate an empirical distribution of FOM estimates. The 95% bootstrap confidence interval was obtained by selecting the 2.5 th and 97.5 th percentile from the posterior distribution.

These analyses were supported by additional examination of the performance metrics from the mock study. JAFROC1 FOMs were estimated for NRMs A – C for a derived fully crossed design that removed reader marks for NRM D and for the randomly-selected performance matched pseudo-reader study (simulation #139). In addition, lesion detection sensitivity was directly assessed for both study designs. Generalized estimating equations (GEE) estimates of per-lesion sensitivity were estimated. For these estimates, a minimal reader confidence threshold was considered helpful to establish a comparison to clinical practice. For this purpose a reader confidence of >10 was felt to be reasonable (e.g., corresponding to the phrase “probable tiny cysts” that a radiologist might use in a clinical report). Correct localization and task confidence >10 was required for a lesion to be considered detected in the sensitivity calculation.

To understand the reproducibility of confidence scores among readers for all reader interpretations, the intra-class correlation was calculated by constructing datasets that listed each reader-specific confidence scores for each true lesion. If a radiologist detected the lesion, the assigned confidence score was used. If a reader did not detect the lesion, a confidence score of 0 (i.e., the value utilized in a JAFROC1 analysis) was used. In addition, we examined the speed at which cases were reviewed over the course of the study using the internal time stamps recorded by the workstation. Data were grouped into reader-patient combinations and the case reading times were modeled using a random effects model with a fixed effect for reading session number. Post hoc comparisons of the model-based mean times by reading session were estimated. P-values reported are two-sided and have not been adjusted for multiple comparisons. Statistical analysis was conducted using R version 3.4.2 (Vienna, Austria). JAFROC1 FOMs were calculated for every NRM and every reader using the Hillis improvement( 11 ) to the Dorfman, Berbaum and Metz method( 12 ) under the modeling assumption of random readers - random cases using the RJafroc package v1.0.1. Lesion sensitivity was calculated using the mrmctools package.

Fully Crossed Validation Study Results

The 10 radiologist trainees each read 80 study datasets (4 NRMs on 20 unique patient datasets). Figure 3 plots the estimated FOMs for the JAFROC1 analysis. None of the pairwise comparisons in FOMs among NRMs were statistically significant (p>0.32 for the six possible comparisons).The confidence intervals were noticeably and expectedly wide with the limited number of cases examined, and there was significant reader variation in the study. In particular, reader-specific FOMs ranged from 0.601 to 0.796 in the study. The time to read cases using the specialized computer workstation also improved over the study ( Figure 4 ). Like the FOMs, there was significant variability among the readers. The most significant drop in time occurred between the first and second reading sessions.

An external file that holds a picture, illustration, etc.
Object name is nihms-1528815-f0003.jpg

Results of the fully crossed validation study. A total of 20 patient datasets reconstructed with 4 different noise reduction methods (NRMs) were read by 10 radiologist trainees. The JAFROC figure of merit was estimated using a random-reader, random-case analysis approach that penalized performance for non-localizations (“false positive”) in cases with and without true lesions (JAFROC1 analysis).

An external file that holds a picture, illustration, etc.
Object name is nihms-1528815-f0004.jpg

Longitudinal analysis of the mean case reading times over the four reading sessions stratified by readers. P-values are tests of model-based means between each time point.

To assess reproducibility of the confidence scores assigned by the 10 readers to the 33 hepatic metastases, the intra-class correlation coefficient (ICC) was utilized. For NRMs A – D, the ICC (95% CIs) were 0.622 (0.499 – 0.751), 0.604 (0.478 – 0.737), 0.657 (0.538 – 0.778) and 0.609 (0.484 – 0.741), respectively. Based on the common Landis and Koch interpretations( 13 ), this would imply that the inter-reader utilization of the confidence scores was in the substantial agreement range with confidence intervals indicating the potential for moderate agreement.

Pseudo-reader Results

Figure 5 plots the histograms and bootstrap estimates of the FOMs of merits from the three general pseudo-reader strategies. There are two distinct trends in the figure. First, judicious blocking on reader performance provides dramatic improvements (reductions) in the variability of FOMs obtained from a pseudo-reader design. The first row in Figure 5 shows the least reproducibility in findings even though these estimates do in fact align well with the estimate obtained using the fully cross design. The second row demonstrates the performance stratified results. While this provides a more precise estimate of the FOM, there appears to be bias in the estimated result (FOMs were overestimated using the stratification plan studied). This finding is likely a direct result of having too much heterogeneity in the [0.6, 0.7) stratum and heavily weighting a single, strong-performing reader in the [0.7, 0.8) stratum. The effect of performance matching is shown in the bottom row of Figure 5 . Here, readers were paired into 5 pseudo-readers. The estimated FOMs were nearly perfectly aligned with the results obtained using the fully crossed study design. It should be noted that the bootstrap confidence interval shown in red in this figure is fundamentally different than the confidence interval presented for the overall JAFROC1 method in this case. Here, the bootstrap interval is a measure of convergence of the pseudo-reader result to the fully crossed study result; not to the general population of all possible FOM results that could be obtained should the fully crossed be replicated.

An external file that holds a picture, illustration, etc.
Object name is nihms-1528815-f0005.jpg

Histograms of the individual estimates obtained by each of the bootstrap replicates for each pseudo-reader strategy. The blue line (dashed) lines represent the estimated FOM (95% CIs) from the fully crossed study design. The red (dashed) lines represent the mean (95% bootstrap confidence interval) for the pseudo-reader study design. Top row shows the results from a single pseudo-reader generated from the 10 readers. The remaining rows show the effect of stratification (middle row) and matching (bottom row) on estimated FOMs.

Mock Study Result

To directly compare the pseudo-reader and fully crossed results, the mock study was created and analyzed as it if were conducted prospectively. Here, unlike the pseudo-reader studies described above, the JAFROC1 estimates of the FOM and associated 95% CIs under both designs were directly comparable. Figure 6 presents the results. The point estimates between the two study designs approaches agree within +/− 3% and the confidence intervals are essentially the same width despite the pseudo-reader approach utilizing half the number of reading interpretations. Table 1 presents the estimate of lesion detection for the fully crossed and performance matched study designs. Similar to the FOMs, there is strong agreement in the estimated lesion sensitivity with numerical matching of the point estimates for two of the three NRMs and NRM B being estimated within +/−3%. The confidence intervals for the pseudo-reader approach, however, were wider, which was a direct result of only utilizing 5 instead of 10 readers to estimate the pooled sensitivity.

An external file that holds a picture, illustration, etc.
Object name is nihms-1528815-f0006.jpg

Comparison of the pseudo-reader result (bottom, red) vs. the fully crossed study (top, black) for a study that utilized noise reduction method D to pair readers based on performance for the mock study. The results of the estimates that utilized 5 pseudo-readers (400 reading interpretations) are compared to those obtained using the fully cross results utilizing all 10 years (800 reading interpretations). For the pseudo-reader estimate, a single estimate from the pseudo-reader was drawn at random from the pool of 2000 bootstrapped estimates.

Lesion-specific sensitivity for the detection of 33 lesions among 20 patient datasets. The estimated sensitivity and confidence intervals are based on GEEs that account for the repeated interpretations of the patient datasets by either 10 readers or 5 performance-matched pseudo-readers derived from the fully crossed study design. The range presented is for the individual reader or pseudo-reader performances.

This study quantified the operating characteristics of human observer performance studies under a new approach to assigning reading sets to human observers by simulating three different pseudo-reader approaches from a fully crossed MRMC dataset. The approach is referred to as a “pseudo-reader”, reflecting the fact that virtual readers are created by combining judiciously assigned datasets to a group of similarly trained readers. This parallelization of the reading list was found to provide comparable results to the traditional fully crossed design. However, unlike the traditional design, the parallelization of the reading across multiple readers has the potential to rapidly evaluate observer performance while directly addressing one of the key limitations in the standard MRMC approach – lesion conspicuity and reader recall. The simulation studies supported the concept that the more exchangeable the readers were, the more the results would align with the fully crossed study.

The logistical and scientific limitation of a large number of reading interpretations has been discussed before with the utilization of a split-plot adaptation of an MRMC study design( 4 , 5 ). The work by Obuchowski et al formally developed a test statistic for the split-plot design( 4 ). There are similarities and differences to our proposed approach. The common goal of reducing reader interpretations and accelerating testing is common to the two approaches. There are some fundamental differences in the conceptual approaches to the two study designs. The split-plot design has a formal statistical foundation that builds upon strict nesting of reader pairs within each stratum. In enforcing this hierarchy to the study design reader performance is estimated and affects the variance components. Such a study design is not readily implemented into standard software such as RJafroc and the need for a hierarchical structure to the data limits flexibility with implementation. With the pseudo-reader approach, the reader variation is allowed to coalesce with the residual error. While statistically speaking this could result in a loss of efficiency, it also opens up more flexibility with study designs. As was shown in this study, performance matching on reader’s performance appears important in the conduct of a pseudo-reader design. The pseudo-reader concept thus deviates from the split plot design by measuring and matching readers around the assumptions of exchangeability of readers. The design is flexible in that many readers may be utilized for a single pseudo-reader in a way that may result in the inability to estimate the variance components attributed to each individual reader. The scenario where there are many new imagining strategies or decision support tools that to be evaluated is expected to be a situation where a pseudo-reader design would be useful. The design could allow for rapid preliminary examination of a wide range of configurations in order to provide some empirical data to plan a confirmatory, standard MRMC study design.

Limitations of the research are worth noting. First, the results of simulations show inherent variability that one might encounter should studies be repeated numerous times. This is a general issue for all MRMC study designs. Confidence intervals help convey this variability. In the case of a pseudo-reader, the within stratum variation attributable to variations in reader performance of matched readers is not directly accounted for in the standard JAFROC analysis. This might suggest that the confidence intervals using pseudo-readers are too narrow (optimistic), but this may be a function of how well the readers are performance matched. Future study is warranted on this topic. Another limitation is that our matched performance approaches are relatively rudimentary, although they showed promise. One could imagine conducting a much more stringent evaluation of reader performance where detection across various lesion sizes was evaluated, confidence score utilization was examined closely, and overall detection ability was quantified through a comprehensive training and evaluation protocol. Our data did not provide this richness, thus computer adaptive approaches to quantify expected reader performance are a topic for further research. In the context of the mock trial, it would have been desirable to have evaluated performance on cases external to the challenge to have provided more purity with respect to the detection task (i.e., it may be possible that recall of conspicuous lesions could confound the results). There is also an inherent limitation in the selection of readers for the study. The readers chosen for the study were relatively homogenous with respect to training and professional experience, something quite different from nay MRMC studies. While there was some consistency in training, reading performance based on the FOMs did demonstrate a range of performance potentially indicating the complexity of the low dose detection tast. These limitations notwithstanding, this was one of the first attempts at validating the pseudo-reader study design.

In conclusion, with attention to reader performance and matching, a multi-pseudo reader, multi-case study design can yield tremendous savings in time while providing comparable levels of quantification of observer performance. Such an approach has the potential to greatly accelerate human evaluation of altered imaging strategies, which is extremely timely given the rapid development of artificial intelligence-based computer decision support tools. In order to achieve this performance gain, the strong assumption of exchangeability of readers needs to be made.

Acknowledgments

Funding: This work was supported by the National Institutes of Health under supplemental award number U01 EB017185. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Health.

Abbreviations

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of Interest: Dr. McCollough received industry funding support from Siemens Healthcare, unrelated to this work. The other authors have nothing to declare.

IMAGES

  1. Figure 1 from Multi-Reader Multi-Case Study for Performance Evaluation

    multi reader multi case study

  2. (PDF) Multi-Reader Multi-Case Studies Using the Area under the Receiver

    multi reader multi case study

  3. GitHub

    multi reader multi case study

  4. Artificial intelligence-assisted diagnosis of congenital heart disease

    multi reader multi case study

  5. Diagnostic performance of the multi-reader multi-case analysis across

    multi reader multi case study

  6. (PDF) Multi-Reader Multi-Case Study for Performance Evaluation of High

    multi reader multi case study

VIDEO

  1. Chapter 3, part 3 (Multiclass, multilabel and multioutput classification)

  2. ACCESS

  3. A Proposed Combination of Formal and Entrepreneurial Approaches to Strategy Creation A Multi Case St

  4. Multilevel Master Reading.Part 1.Ex 26

  5. San Francisco, Antioquia Colombia: Río Santo Domingo

  6. Fujitsu

COMMENTS

  1. iMRMC: Software to do Multi-reader Multi-case Statistical Analysis of

    Technical Description. The primary objective of the iMRMC statistical software is to assist investigators with analyzing and sizing multi-reader multi-case (MRMC) reader studies that compare the ...

  2. Multi-Reader Multi-Case Studies Using the Area under the ...

    Introduction We examined the design, analysis and reporting in multi-reader multi-case (MRMC) research studies using the area under the receiver-operating curve (ROC AUC) as a measure of diagnostic performance. Methods We performed a systematic literature review from 2005 to 2013 inclusive to identify a minimum 50 studies. Articles of diagnostic test accuracy in humans were identified via ...

  3. Multi-reader multi-case analysis of variance software for diagnostic

    MRMCaov performs multi-reader multi-case analysis of variance for the performance comparison of imaging modalities. This software is the first R implementation of Dr. Hillis unified methodology and builds upon his OR-DBM MRMC SAS software. 8 It is designed to be user friendly, integrate with the R statistical computing and graphics environment ...

  4. Multireader Diagnostic Accuracy Imaging Studies: Fundamentals of Design

    Multi-reader multi-case studies using the area under the receiver operator characteristic curve as a measure of diagnostic accuracy: systematic review with a focus on quality of data reporting. PLoS One 2014;9(12):e116018. Crossref, Medline, Google Scholar; 2. Zhou XH, Obuchowski NA, McClish DL. Statistical Methods in Diagnostic Medicine. 2nd ed.

  5. Obuchowski-Rockette analysis for multi-reader multi-case (MRMC) readers

    The Obuchowski-Rockette method has been an important tool for analyzing multi-reader multi-case (MRMC) radiologic imaging data. Although the typical study design for such studies has been the factorial design, where each reader reads each case using each test (modality), sometimes a reader-nested-in-test design is more appropriate.

  6. Hypothesis testing methods for multi-reader multi-case studies

    Corresponding analyses can be conducted in a multi-reader multi-case (MRMC) f... In the radiology field, whether magnetic resonance imaging (MRI) is better than mammograms is a problem of interest. Hypothesis testing methods for multi-reader multi-case studies: Biostatistics & Epidemiology: Vol 7 , No 1 - Get Access

  7. Improving Prostate Cancer Detection With MRI: A Multi-Reader, Multi

    Improving Prostate Cancer Detection With MRI: A Multi-Reader, Multi-Case Study Using Computer-Aided Detection (CAD) Author links open overlay panel Mark A. Anderson MD 1 2 #, Sarah Mercaldo PhD 3 #, Ryan Chung MD 1 2, ... Based upon PI-RADS scores, multi-reader multi-case (MRMC) analysis was used to compare the reader performance without and ...

  8. Generalized linear mixed models for multi-reader multi-case studies of

    Diagnostic tests are often compared in multi-reader multi-case (MRMC) studies in which a number of cases (subjects with or without the disease in question) are examined by several readers using all tests to be compared. One of the commonly used methods for analyzing MRMC data is the Obuchowski-Rockette (OR) method, which assumes that the true ...

  9. Multi-reader multi-case studies using the area under the receiver

    Introduction: We examined the design, analysis and reporting in multi-reader multi-case (MRMC) research studies using the area under the receiver-operating curve (ROC AUC) as a measure of diagnostic performance. Methods: We performed a systematic literature review from 2005 to 2013 inclusive to identify a minimum 50 studies. Articles of diagnostic test accuracy in humans were identified via ...

  10. PDF MRMCaov: Multi-Reader Multi-Case Analysis of Variance

    Title Multi-Reader Multi-Case Analysis of Variance Version 0.3.0 Date 2023-01-09 Author Brian J Smith [aut, cre], ... Description Estimation and comparison of the performances of diagnostic tests in multi-reader multi-case studies where true case statuses (or ground truths) are known and one or more readers provide test ratings for multiple

  11. Generalized linear mixed models for multi-reader multi-case studies of

    Abstract. Diagnostic tests are often compared in multi-reader multi-case (MRMC) studies in which a number of cases (subjects with or without the disease in question) are examined by several readers using all tests to be compared. One of the commonly used methods for analyzing MRMC data is the Obuchowski-Rockette (OR) method, which assumes that ...

  12. Cancers

    Physicians use sonographic characteristics as a reference for the possible diagnosis of thyroid cancers. The purpose of this study was to investigate whether physicians were more effective in their tentative diagnosis based on the information provided by a computer-aided detection (CAD) system. A computer compared software-defined and physician-adjusted tumor loci. A multicenter, multireader ...

  13. Impact of artificial intelligence support on accuracy and ...

    A multi-reader multi-case study was performed with 240 bilateral DBT exams (71 breasts with cancer lesions, 70 breasts with benign findings, 339 normal breasts). Exams were interpreted by 18 radiologists, with and without AI support, providing cancer suspicion scores per breast. Using AI support, radiologists were shown examination-based and ...

  14. An Exploratory Multi-reader, Multi-case Study Comparing Transmission

    In this multi-reader multi-case (MRMC) study, we examined retrospective data from two clinical trials conducted at five sites. All subjects received FFDM and QT scans within 90 days. Data were analyzed in a reader study with full factorial design involving 22 radiologists and 108 breast cases (42 normal, 39 pathology-confirmed benign, and 27 ...

  15. Impact of artificial intelligence support on accuracy and ...

    Methods: A multi-reader multi-case study was performed with 240 bilateral DBT exams (71 breasts with cancer lesions, 70 breasts with benign findings, 339 normal breasts). Exams were interpreted by 18 radiologists, with and without AI support, providing cancer suspicion scores per breast. Using AI support, radiologists were shown examination ...

  16. Multi-Reader Multi-Case Studies Using the Area under the Receiver

    Radiological tests must be interpreted by human observers and a common study design uses multiple readers to interpret multiple image cases; the multi-reader multi-case (MRMC) design . The MRMC design is popular because once a radiologist has viewed 20 cases there is less information to be gained by asking him to view a further 20 than by ...

  17. iMRMC: Multi-Reader, Multi-Case Analysis Methods (ROC, Agreement, and

    Do Multi-Reader, Multi-Case (MRMC) analyses of data from imaging studies where clinicians (readers) evaluate patient images (cases). What does this mean? ... Many imaging studies are designed so that every reader reads every case in all modalities, a fully-crossed study. In this case, the data is cross-correlated, and we consider the readers and cases to be cross-correlated random effects. An ...

  18. Chest radiograph classification and severity of suspected ...

    Our study differs in that it pivots around three potential clinical scenarios that use the CXR to manage suspected COVID-19. Using a prospective multi-reader, multi-case design, we determined reader agreement for four clinical groups who are tasked with CXR interpretation in daily practice and compared these to a consensus reference standard.

  19. Impact of artificial intelligence support on accuracy and reading time

    A multi-reader multi-case study was performed with 240 bilateral DBT exams (71 breasts with cancer lesions, 70 breasts with benign findings, 339 normal breasts). Exams were interpreted by 18 radiologists, with and without AI support, providing cancer suspicion scores per breast. Using AI support, radiologists were shown examination-based and ...

  20. OSF

    In multi-reader multi-case study designs, do AI-diagnostic tools improve accuracy outcomes for radiologists in cancer risk stratification? The future of AI seems to be as a diagnostic support tool integrated within a radiologist's workflow, allowing an increase in efficiency and a reduction in errors1. Studies that thus compare AI models with ...

  21. Multi-Reader Multi-Case Study for Performance Evaluation of High-Risk

    A multicenter, multireader, and multicase (MRMC) study was designed to compare clinician performance without and with the use of CAD. Interobserver variability was also analyzed. Excellent, satisfactory, and poor segmentations were observed in 25.3%, 58.9%, and 15.8% of nodules, respectively. There were 200 patients with 265 nodules in the ...

  22. What is quality in long covid care? Lessons from a national quality

    The LOCOMOTION study. LOCOMOTION (LOng COvid Multidisciplinary consortium Optimising Treatments and services across the NHS) was a 30-month multi-site case study of 10 long covid clinics (8 in England, 1 in Wales and 1 in Scotland), beginning in 2021, which sought to optimise long covid care.

  23. Computers

    Aspect-based sentiment analysis (ABSA) is a fine-grained type of sentiment analysis; it works on an aspect level. It mainly focuses on extracting aspect terms from text or reviews, categorizing the aspect terms, and classifying the sentiment polarities toward each aspect term and aspect category. Aspect term extraction (ATE) and aspect category detection (ACD) are interdependent and closely ...

  24. Systems

    In cloud manufacturing environments, the scheduling of multi-user manufacturing tasks often fails to consider the impact of service supply on resource allocation. This study addresses this gap by proposing a bi-objective multi-user multi-task scheduling model aimed at simultaneously minimising workload and maximising customer satisfaction. To accurately capture customer satisfaction, a novel ...

  25. Evaluation of pseudo-reader study designs to estimate observer

    To date, the standard approach is to utilize a multi-reader, multi-case (MRMC) study design wherein a large number of readers evaluate a large number of cases examining different imaging alternatives. MRMC designs are categorized as fully crossed when each reader reviews every patient case using each imaging strategy. The resources to conduct ...

  26. Nonprofit hospitals posting a profit should lose tax exempt status

    In a twist of irony, a 2021 study published in Health Affairs found that for-profit hospitals provided 65% more charity care than nonprofit ones. The bar for receiving tax-exempt status should not ...

  27. PDF RESEARCH ARTICLE Multi-Reader Multi-Case Studies Using the Area under

    and a common study design uses multiple readers to interpret multiple image cases; the multi-reader multi-case (MRMC) design [4]. The MRMC design is popular because once a radiologist has viewed 20 cases there is less information to be gained by asking him to view a further 20 than by asking a different radiologist to view the same 20.