|
CyTOF Data Analysis:
Discrimination versus Correlation ?Günter Valet |
Multiparameter flow cytometry: Data analysis in multiparameter flow cytometry is of high importance for understanding experimental results. Particular challenges emerge from the rising number of parameters like in mass spectrum cytometry (CyTOF). This concerns data display but equally information and knowledge extraction. Early analysis in the clinical context aimed at diagnosis as well as individualized predictions concerning therapy related disease course or disease outcome in patients (1, 2, 3, 4).
Correlation analysis: After sequencing the human genome ( 5), it seemed promising to analyze the multitude of spots of patient cell or tissue lysates on RNA expression arrays by hierarchical correlation analysis ( 6) rather than to enlarge the number of single cell parameters by flow cytometry. Similarly behaving parameters were rearranged during the correlative analysis to stratify diseased patients into different prognosis groups (cohort future) ( 7, 8). Patient stratification in medicine represents a makeshift solution due to the inability to predict further disease course and outcome in individual patients. Stratification frequently helps to identify patient groups with higher therapeutic susceptibility. Patients in stratified groups are mixed in various proportion with patients from other groups. This means that molecular signatures of stratified patient groups are "contaminated". It is therefore important to strive for individualized predictions by flow and image cytometry (predictive medicine by cytomics)
Cells as elementary function units of organisms carry molecular information about presence (diagnosis) and future (prediction). The multimolecular analysis of individual cells provides more information than the molecular mixtures in cell lysates or homogenates. With this idea, it seemed worthwile to further develop flow cytometers, capable of measuring more parameters than the early 3- or 4-parameter instruments, for example between 17 ( 9) to 20 parameters. The construction of mass cytometry flow and image cytometers (CyTOF) collecting around 40 parameters ( 10) provided an enormeous gain of information with new opportunities for individualized predictions in medicine. Data analysis remains, however, mostly correlative ( 12, 13) thus missing a substantial part of the discrimination potential of multiparameter cytometry.
Discrimination analysis: Discrimination analysis is primarily directed towards individualized conclusions in high complexity data analysis. Classification may be probabilistic using for example statistical Bayes classifiers ( 14) but more easily algorithmic with the advantage that algorithms do typically not require assumptions concerning parameter distributions, no reconstitutions have to be made in case of outliers or missing values and they are readily suitable for interlaboratory standardization and aim at greater 95% or 99% accuracy.
The discriminating CLASSIF1 algorithm (2, 4) starts with the transformation of numeric parameter values into the triple matrix characters (-), (0) and (+) depending on whether the values are situated below, between or above pairs of predetermined percentile thresholds such as 5/95, 10/90, 15/85, 20/80, 25/75 and 30/70%. The resulting multiparameter triple matrix is iteratively classified to optimally discriminate between different patient categories like therapy responders/non responders, survivors/non survivors or future statuses like improvement, stationary or deterioration. Parameters not improving the classification are removed at the end of the classification process. Correlated information is not very discriminatory and automatically eliminated during the iterative parameter selection process because the presence of such parameters does not improve the classification result. The iterations lead to disease classification masks containing typically between 5-15 parameters to avoid overdetermination for patient sets between 100-300 patients. Such classification masks are obtained by stepwise lowering the percentile thresholds for the learning set from initially 30/70% to typically 15/85% or 10/90%. Disease classification masks for presently up to 10 disease states from up to 64000 parameters can be obtained in this way.
Patients 1,5,10,15 ... of each data set are assigned to the unknown validation set prior to learning to assess the classifer capacity to correctly identify unknown patients. Classifier robustness is subsequently checked by sequentially assigning patients 2,6,11,16... , 3,7,12,17..., and 4,8,13,18... to the unknown validation set prior to relearning (cross classification). CLASSIF1 classifiers are typically robust against cross classification.
Upon completion of classifier learning and validation, a subsequent correlation analysis, using the determined disease classification mask parameters, may detect unknown multistep processes. Patients may benefit in this way immediatedly from new classifiers, while scientists obtain highly discriminatory parameter patterns for further investigations of correlated disease inducing or entertaining processes.Proposal data repository: It is proposed that publications with CyTOF or other multiparameter cytometry heatmaps contain an Internet link to the heatmap matrix (individual patients (x), discriminating parameter intensities (y)) together with supplementary patient information (disease outcome, age, sex ... etc) on a public internet data repository for further scientfic analysis, similarly to RNA expression array data on the GEO database ( 15). This could be a section of the flow repository ( 16) for heatmap data, obtained by standardized CyTOF FCS measurement and evaluation for analysis by other classification methods as shown in the case of RNA expression arrays (17).
Proposal direct classification: There is also the possibility to have data in Excel or alternatively in ASCII files with parameter values separated by blanks, Tabs or commas (CSV) directly classified, following notice to the author (see "Mail" below). The first line of such files should contain the sample IDs, the first row the parameter identifiers like CD-antigen names or gene codes. Anonymized patient information (age, sex, disease state, outcome ... etc) should be provided in a separate file of the same format. Data of at least 40-50 individuals are required in the beginning to define learning sets of between 30 -40 patients and unknown validation (test) sets containing data of 8-10 patients.
References:
1.
Valet G, Warnecke HH, Kahle H.
Automated diagnosis of malignant and other abnormal cells by flow-cytometry
using the newly developed DIAGNOS1 program system. In:
Proc Int Symp Histometry 1986, Eds: Burger G, Ploem B, Goerttler K,
Academic Press, London, p 58-67 (1987)
2.
Valet G, Valet M, Tschöpe D, Gabriel H, Rothe G, Kellermann W, Kahle H.
White cell and thrombocyte disorders: Standardized, self-learning flow cytometric
list mode data classification with the CLASSIF1 program system.
Ann NY Acad Sci 677:233-251 (1993)
3.
Valet G.
Human cytome project: A new potential for drug discovery. In:
Las Omicas genomica, proteomica, citomica y metabolomica: modernas
tecnologias para desarrollo de farmacos.
Ed: Real Academia Nacional de Farmacia, Madrid, p 207-228 (2005)
4.
Valet G.
CLASSIF1 data analysis in multiparameter flow cytometry.
(https://www.classimed.de/classif1.html)
5.
International Human Genome Sequencing Consortium Publishes Sequence and Analysis of the Human Genome
February 12, 2001
6.
Eisen MB, Spellman PT, Brown PO, Botstein D
Cluster analysis and display of genome-wide expression patterns,
PNAS USA 95:14863–14868 (1998)
7.
Rosenwald AR et al.
The Use of Molecular Profiling to Predict Survival after Chemotherapy
for Diffuse Large B-cell Lymphoma.
NEJM 346:1937-1947 (2002)
8.
Grau1 M, Lenz G, Lenz P.
Dissection of gene expression datasets into clinically relevant interaction
signatures via highdimensional correlation maximization.
Nat Comm 10:5417 | https://doi.org/10.1038/s41467-019-12713-5 | (2019)
9.
Perfetto S, Chattopadhyay P, Roederer M.
Seventeen-colour flow cytometry: unravelling the immune system.
Nat Rev Immunol 4:648–655 https://doi.org/10.1038/nri1416 (2004).
10.
Bendall SC, Nolan GP, Roederer M Chattopadhyay PK
A Deep Profiler’s Guide to Cytometry.
Trends Immunol 33:323-330 (2012)
12.
Krieg C et al.
High-dimensional single-cell analysis predicts response
to anti-PD-1 immunotherapy.
Nature Medicine 24:144-153 (2018)
13.
Fu W et al.
CyTOF Analysis Reveals a Distinct Immunosuppressive
Microenvironment in IDH Mutant
Anaplastic Gliomas.
Front Oncol 10:560211 (2021)
doi: 10.3389/fonc.2020.560211
14.
Lakoumentas J et al.
Optimizations of the naive-Bayes classifier for the prognosis of
B-Chronic Lymphatic Leukemia incorporating flow cytometry.
Comp Meth Prog Biomed 108:158-167 (2012)
15.
Gene Expression Omnibus (GEO) database
(https://www.ncbi.nlm.nih.gov/geo/).
16.
FlowRepository (http://flowrepository.org)
17.
DLBCL reanalysis (https://www.classimed.de/dlbcl1.html)
© 2024 G.Valet |