Improving the Rigor and Reproducibility of Flow Cytometry‐Based Clinical Research and Trials Through Automated Data Analysis
2019; Wiley; Volume: 97; Issue: 2 Linguagem: Inglês
10.1002/cyto.a.23883
ISSN1552-4930
Autores Tópico(s)Gene expression and cancer classification
ResumoThe steps that characterize data analysis for flow cytometry-based clinical trials can be grouped into quality assessment, compensation, normalization, transformation, cell population identification, cross-sample comparison (population mapping or matching), feature extraction, visualization, and interpretation. Of these steps, cell population identification (i.e., gating), which generates reportables such as cell population counts/percentages and MFI, is the focus of significant efforts to improve the rigor and reproducibility of data analysis. Here, rigor is the application of the scientific method to ensure unbiased and well-controlled analysis, interpretation, and reporting of results. Reproducibility is important as analyses are only validated when they can be duplicated by multiple scientists. It is especially important for clinical studies due to what has been deemed a reproducibility crisis in medicine, as only 11% of a series of preclinical cancer studies could be replicated 1. While manual gating is the gold standard and current practice for cell population identification, assessments of its reproducibility have recognized it as a significant contributor of variation in flow cytometry studies, with interlaboratory C.V.s up to 30% 2. While having a single operator analyze all the data can significantly reduce variability, the chosen individual is still influenced by the same personal biases that result in inconsistencies within and across individuals, centers, and time. Resorting to a single operator to improve reproducibility does not scale for large studies as manual gating can take from 45 to 90 min for one clinical sample 3. With clinical trials now involving thousands of patients assayed with 18 parameters, the challenges associated with the rigor and reproducibility of manual analysis of flow cytometry-based clinical data have only become more apparent and pressing. The first algorithm for automated flow cytometry cell population identification was published in 1985, where it was noted that: "Unfortunately, the use of three or more independent fluorescent parameters complicates the analysis of the resulting data significantly" 4. While data complexity has increased significantly since then, automated analysis approaches for flow cytometry data have also matured. The state of the art is at a stage where results from automated data analysis algorithms have reached a level of maturity that enables them to match, and in many cases exceed, the results produced by human experts. The maturity and acceptance has gone as far as being included as part of the documentation for the FDA approval of a first in class CAR-T therapy 5-7. The overwhelming majority of automated gating methods are unsupervised. Unsupervised algorithms are those approaches that work on data that do not come with predefined labels. With respect to FCM data, these algorithms find by themselves some commonality of the features of cells that can be used to group them together into cell populations. As a class of approaches, these do not require any training nor any or only very limited parameterization making them easy to use. An example parameterization in the context of gating is the process of choosing characteristics that can be used to classify cells into groups, such as predefining how many cell populations exist in the data set. These approaches use internal metrics to make decisions such as: how close events are to each other in multidimensional space based on the choice of the distance measure used, how many cell populations exist (e.g., the "k" in k-means clustering), or when to place a gate between groups of cells (where to make a cut between adjacent events). However, this is also a limitation of unsupervised approaches, as the varying size, shape, and distribution of cell populations within and across samples means global parameterizations need to find the best middle ground. Thus, the factors that make these algorithms easy to use also tend to limit their performance. Unsupervised algorithms also have no way of knowing how to label cell populations. A metric commonly used to assess the performance of clustering approaches is the F1 measure. F1 is the harmonic mean of precision and recall, with a score of 1 indicating that all individual events were placed in the same series cell population by both analysis methods. Here, manual gating is set to be the gold standard, and any deviation from manual results lowers the F1 measure from 1.0 (perfect agreement of every event across all gates measured). The mean F1 measure of the currently best performing unsupervised algorithm is approximately 0.78, with a relatively large distribution in scores across cell populations 6. However, a caveat is having set manual gating as the gold standard, even if an automated gating approach is "better," any differences will still be counted as mistakes on the part of the algorithm unless such differences are reviewed and the manual analysis is modified. In our experience, detailed review of discordant examples during training is extremely important to both improve the parameterization and not come to false conclusions with respect to a seemingly poor rigor of automated methods. Regardless, the current level of performance may lead to lack of adaptation for clinical studies. However, unsupervised algorithms have been successfully used on some limited use cases 8. Similarly, dimensional reduction techniques and visualization tools such as SPADE, t-SNE, Wanderlust, Citrus, and PhenoGraph are generally useful only for discovery as they do not directly identify well-defined homogenous cell populations in the traditional sense, and cannot generate the required reportables 9. As an alternative, supervised cell population identification methods are parameterized by users based on expert knowledge all the way down to individual cell populations based on their unique characteristics observed in bivariate plots (e.g., their shape and distribution relative to other cell populations). A nonexhaustive overview of the main features of typical supervised and unsupervised analysis tools relative to manual gating was recently conducted by Mair (Fig. 1) 10. Supervised methods tend to outperform unsupervised methods when the goal is to replace manual gating for generating cell population statistics, as evaluated through open peer-reviewed comparative studies 3, 11-14. However, as with other automated methods, there has been limited adoption due to barriers such as the required bioinformatics expertise. Due to patient confidentiality and intellectual property concerns, flow cytometry data from clinical trials have not been published in the peer-reviewed literature. Two recent studies illustrate how automated approaches improved the rigor and reproducibility of the analysis. In "Implementation and Validation of an Automated Flow Cytometry Analysis Pipeline for Human Immune Profiling," we described the practical implementation of an automated flow cytometry analysis pipeline for human immune profiling 13. Data were generated using two staining panels that identified effector and memory or helper and regulatory T cells. A core panel of 8–10 fluorochromes was used to identify immune subsets of interest to which three fluorochromes can be added to measure 6–15 activation markers per patient per time point, depending on the desired depth of analysis 13 After acquisition, FCS files were opened with the FlowJo software to adjust compensation and the resulting workspaces were read into R for all further data processing 15. While a significant focus has been rightly placed on automated gating, the adoption of automated approaches for cell population identification opens up additional avenues for improving the rigor and reproducibility of clinical trial data analysis within the analysis pipeline once data enter a computational stream. For example, rigor can also be improved through the automated identification of outliers. This starts at the event level, by analyzing various metrics of data acquisition such as flow rate and fluorescent measurement fluctuations within a sample and flagging or removing suspect data points 16, 17. Automated quality checking can also extend to postgating steps. The QUALIFIER algorithm uses the gating template to perform quality checks on gated populations and overall properties of a sample such as cell counts. It divides the data preprocessing from the actual outlier detection process so that the statistics are calculated all at once and the outlier detection and visualization can be done more efficiently and interactively 18. In the Conrad study, once pregating quality checking was competed using such approaches, gates were set by flowDensity. Unlike typical clustering algorithms that tend to identify populations by examining all dimensions simultaneously, flowDensity is based on a sequential bivariate gating approach that generates a set of predefined cell populations using a prespecified approach customized for each cell populations of interest. The algorithm mimics manual gating steps, but chooses the best cutoff for individual markers using characteristics of their density distribution such as the slope of density, or the minimum intersection point between the two peaks of the density (Fig. 2). For activation marker gating, a gate boundary set on the fluorescence minus one control was applied to matching samples with the fluorochrome present. Gates are set independently for each data file based on these rules. Once the parameterization of the gating steps is completed, gating steps run in sequence by a single script, only requiring manual input to specify the directory where the files are located and the desired location for the outputs. The approach tends to be robust, as long as new data files are generally similar to those used to set the gate boundaries. Conrad et al. tested how the analysis pipeline with flowDensity performed compared with manual gating conducted by domain experts on data from five patients. Manual and automated analyses were performed on the same 11 populations of interest over a total of 33 time points (2–3 tests per time point), yielding 1,077 matched populations. Comparison of these populations showed a significant correlation between the manual and automated analysis of the T cell panel (Fig. 3). While the bulk of the populations showed strong agreement, manual gating of populations with few events were not matched as closely by automated analysis. The larger disagreement can be attributed primarily to the small number of events in that population (the impact of small shifts in gate placement is magnified on smaller cell populations) and secondarily to the lack of resolution between populations (i.e., smears). When controls are not used for such rare populations, there is often no objective information available to guide either automated or gating. This makes the evaluation of automated analysis approaches for such cell populations especially challenging. Despite these differences, the observed trends also demonstrated the same trends as those obtained from manual gating, which is an essential criterion for using this method in longitudinal studies (Fig. 3). We similarly demonstrated high reproducibility of a supervised analysis pipeline in peripheral blood from both healthy subjects, patients 10 days after hematopoietic stem-cell transplantation, using different instruments from different vendors and across centers 3. Like the Conrad et al.'s study, we developed an automated analysis workflow based on flowCore and flowDensity. Data were generated using DuraClone's dry reagent technology (Beckman Coulter) preformatted panel antibody cocktails. The performance of automated pipelines was assessed by their ability to match on a per-event basis values obtained by an expert manual analyzer (i.e., the reference manual), currently considered the "gold standard" approach. In addition, we compared the reference manual values with those obtained by two additional manual analyzers who followed an identical gating strategy. Automated analyses produced results that were highly correlated with those obtained in the reference manual analysis. For example, Spearman's rank correlation coefficient [ρ, rs] comparing reference manual with automated analysis for the 14 basic panel cell populations was all >0.8. Most median F1 scores were > 0.9, with an overall F1 average of 0.93. To demonstrate robustness, we obtained an independent data set from the ONE Study, which used the same antibody panel and fluorescence intensity settings 19, and reanalyzed these data manually and with our automated pipelines. When all populations analyzed were combined for the two data sets, correlation values between automated and manual gating were all >0.9, demonstrating that automated gating pipelines developed with one set of data can readily be used to accurately analyze independent data if they are collected using the same standardized methodology. To test the robustness of the analysis pipelines to alternate instrument platforms, we analyzed parallel samples acquired on a Navios (Beckman Coulter, 3 lasers) or a Fortessa X20 (BD Biosciences, 4 lasers), and obtained correlation values >0.99. In general, we found that automated gating agreed slightly less with the reference manual than other manual gating. As in the Conrad study, lower agreement occurred especially for low-abundance, poorly defined populations (e.g., plasmablasts). Indistinct boundaries, such as between CD14+ and CD14++ (particularly in the CD16+ population), as well as between CD16− and CD16+, led to variability in manual gating (Manual 1 and Manual 2 vs. reference manual, both rs = 0.83), as well in automated versus reference manual (rs = 0.83). While rigor and robustness are important drivers for adoption of automated analysis, computational approaches additionally provide significant efficiency gains with their rapid analyses (~1 min of computer time per set of 25 samples from raw data to final spreadsheet versus 10–20 expert hours for the equivalent manual analyses) 3. However, the performance of supervised methods comes at the cost in time for choosing the correct parameterization. This depends on the size and complexity of the panel and the number of cell populations being assayed. It takes on average about two weeks per panel, on par with how long it takes to set up and validate a manual gating hierarchy. Unlike manual gating, once setup is completely automated, gating can be exactly reproducible on a given data set with human effort limited to running the program a second time, with processing taking about 60 s per file 3, 13. However, supervised approaches need to be parameterized based on the heterogeneity that will be observed during the course of the clinical study. While relative shifting of populations is well tolerated (as shown by the high correlations obtained in the studies described above), it is important to have prior knowledge of the diversity of cell populations that occur across all samples of the study. If unexpected populations occur, these samples can be automatically flagged based on quality checking of cell population statistics being out of the expected range (e.g., greater than three standard deviations from the mean percentage of that population for all samples analyzed to date). The approach can then be reparameterized for future samples. This is no different in principle from what would happen with manual analysis. Machine learning holds the promise of reducing the time required to parameterize supervised algorithms 20, 21. However, like unsupervised methods, they have not yet shown the necessary performance. Moving to automated analysis also has the advantage of directly improving the ability to reproduce experiments, an essential part of the scientific process. The Minimum Information About a Flow Cytometry Experiment is an adopted example of one way to provide key pieces of information needed to reproduce results and includes specification of the analysis performed on the resulting data. Documentation of manual gating protocols often includes descriptions such as to set gate boundaries "by eye" with no guidance for rules to follow, or even just images of how gates are set on a single sample with no description of the logic used. Automated cell population identification is fully descriptive of the process to set gate boundaries. The understanding of algorithms available for analysis is enabled through the choice of R/Bioconductor as the programming language for the overwhelming majority (more than 50 to date) of available automated flow cytometry data analysis tools. R is widely used for statistical computing, machine learning, and graphics, and has an extremely large, worldwide community of users, developers, and contributors. R is a free, open-source software environment for statistical computing and graphics that runs across UNIX, Windows, and MacOS. Bioconductor is a repository of R-based tools for the analysis and comprehension of high-throughput genomic data 22, 23. Bioconductor provides a structured platform enforcing compatibility and documentation. A main project goal stated by Bioconductor is to further scientific understanding by producing high-quality documentation and reproducible research. Each Bioconductor package contains one or more vignettes, which are documents that provide a textual, task-oriented description of the package's functionality. These vignettes at minimum demonstrate the algorithm's functionality through worked examples on real data that are also provided alongside the code. Almost all these algorithms have accompanying peer-reviewed publications. In addition, literate programming can be used to make results not only reproducible but easily understandable. It further promotes rigor through improved reporting by changing how software is written from telling a computer what to do, to explaining to humans what we want the computer to do 24. Furthering reproducibility, R Markdown provides a mechanism to generate fully reproducible documents with embedded results generated computationally by the document itself based only on the provided FCS files. Developers can weave together descriptive text and code by using R Markdown and knitr to produce formatted output such as a Word or PDF file that can include dynamically generated figures illustrating gating hierarchies and reportables. Given a set of code and data, exact results should intuitively be reproducible by another scientist. While sharing R scripts along with an R Markdown document should warrant reproducibility, many analyses rely on additional resources and specific third-party software as well. Code may produce unexpected results or errors when executed under a different version of R or another platform. Reproducibility is only assured by providing complete setup instructions and resources. However, regular maintenance of code so that it always works with the latest software versions to ensure long-term reproducibility can be a challenge. As an alternative, it is possible to provide the full compute environment in its original state, using virtual machines or software containers. This approach ensures reproducibility by removing unknown contributions of subtle pipeline changes. Thus, the final piece in ensuring reproducible results is the use of virtual containers, which isolate and bundle applications into portable self-contained units. Containers bundle up all the dependencies of a local system so that code is reproducible anywhere across host operating systems for all time. They enable reproducibly scaling computation on high-performance computing and distribution between sites. While Docker is a popular container choice, Singularity provides additional features such as reproducible software stacks that can be verified by a checksum and supporting a model of untrusted users running untrusted containers. Implementation of automated methods is not without potential challenges. The available methods are generally the result of academic research. While there is commercial software that provides out of the box 21 CFR Part 11 compliance features, they are designed around the manual analysis process and not tailored specifically with automation in mind. However, the latest versions of gating software from major third-party software vendors do support R integration. In addition, through the use of R/Shiny, it is possible to very quickly generate the simple user interface required for automated batch processing. This essentially requires just the specification of the location of the FCS files to be processed and the desired location of the output files (e.g., CSV files containing reportables, lists of flagged files, commercial software workspaces). The process can even be further simplified by having the system monitor a directory and process FCS files as they are saved there by the instrument. Data-driven analysis pipelines based on supervised gating methods can result in improved reproducibility and reliability and reduced effort in large clinical cohorts where the goal is to assess phenotypically well-known and biologically relevant populations 10. However, adoption of such methods is in its nascency. Factors that limit the uptake of computational approaches include lack of awareness and trust and either access to bioinformaticians or computational training among immunologists. Algorithmic approaches tend to be published in computational journals not commonly read by immunologists. As these become adopted and applied on biological data sets to make discoveries that warrant publication in high impact journals, awareness and trust will increase. Adding to the inevitability of increased adoption over time, each new generation of scientists will also on average be more computationally inclined than the last. More extensive dissemination of basic experimental guidelines to ensure high quality of raw data is pivotal for the success of any analysis pipeline, whether manual or automated 10, 25. Finally, as methods continue to progress in their capabilities and performance, reasons to maintain rooted in an era of manual analysis will continue to become less compelling than they already are. On that note, supervised gating methods have the specific advantage, which has already been shown through peer-reviewed studies and approved clinical trials, that they can be robust, reproducible, and faster than manual analysis for the analysis of clinical data and have better performance than unsupervised methods when the goal is to identify specific cell populations of interest.
Referência(s)