Classimed logo

CyTOF Data Analysis:
Discrimination or Correlation ?

Günter Valet

Multiparameter flow cytometry: Data analysis in multiparameter flow cytometry is of high importance for understanding experimental results. Particular challenges emerge from the rising number of parameters like in mass spectrum cytometry (CyTOF). This concerns data display but equally information and knowledge extraction. Early analysis in the clinical context aimed at diagnosis as well as individualized predictions concerning therapy related disease course or disease outcome in patients (1, 2, 3, 4).

Correlation analysis: After sequencing the human genome (-external link 5), it seemed promising to analyze the multitude of spots of patient cell or tissue lysates on RNA expression arrays by hierarchical correlation analysis (-external link 6) rather than to enlarge the number of single cell parameters by flow cytometry. Similarly behaving parameters were rearranged during the correlative analysis to stratify diseased patients into different prognosis groups (cohort future) (-external link 7, -external link 8). Patient stratification in medicine represents a makeshift solution due to the inability to predict further disease course and outcome in individual patients. Stratification frequently helps to identify patient groups with higher therapeutic susceptibility. Patients in stratified groups are mixed in various proportion with patients from other groups. This means that molecular signatures of stratified patient groups are "contaminated". It is therefore important to strive for individualized predictions by flow and image cytometry (predictive medicine by cytomics)

CyTOF data analysis (PPT)
to help patients, clinicians and scientists in the effort of obtaining disease state specific molecular signatures for further studies. This is possible by using discriminating instead of correlating data analysis Correlation (similarity) analysis is useful for the exploration of closely linked entities in species development via protein (hemoglobin) or DNA sequences with the aim to clarify phylogenetic relations in animals, plants or man in the effort to understand for example the relation between homo Neandertaliensis and homo sapiens. Correlation analysis is also of interest for the detection of closely linked components in multistep processes like in metabolic chains but no individualized predictions concerning disease progress or outcome in patients are typically obtained by this type of analysis.

Cells as elementary function units of organisms carry molecular information about presence (diagnosis) and future (prediction). The multimolecular analysis of individual cells provides more information than the molecular mixtures in cell lysates or homogenates. With this idea, it seemed worthwile to further develop flow cytometers, capable of measuring more parameters than the early 3- or 4-parameter instruments, for example between 17 (-external link 9) to 20 parameters. The construction of mass cytometry flow and image cytometers (CyTOF) collecting around 40 parameters (-external link 10) provided an enormeous gain of information with new opportunities for individualized predictions in medicine. Data analysis remains, however, mostly correlative (-external link 12, -external link 13) thus missing a substantial part of the discrimination potential of multiparameter cytometry.

Discrimination analysis: Discrimination analysis is primarily directed towards individualized conclusions in high complexity data analysis. Classification may be probabilistic using for example statistical Bayes classifiers (-external link 14) but more easily algorithmic with the advantage that algorithms do typically not require assumptions concerning parameter distributions, no reconstitutions have to be made in case of outliers or missing values and they are readily suitable for interlaboratory standardization and aim at greater 95% or 99% accuracy.

The discriminating CLASSIF1 algorithm (2, 4) starts with the transformation of numeric parameter values into the triple matrix characters (-), (0) and (+) depending on whether the values are situated below, between or above pairs of predetermined percentile thresholds such as 5/95, 10/90, 15/85, 20/80, 25/75 and 30/70%. The resulting multiparameter triple matrix is iteratively classified to optimally discriminate between different patient categories like therapy responders/non responders, survivors/non survivors or future statuses like improvement, stationary or deterioration. Parameters not improving the classification are removed at the end of the classification process. Correlated information is not very discriminatory and automatically eliminated during the iterative parameter selection process because the presence of such parameters does not improve the classification result. The iterations lead to disease classification masks containing typically between 5-15 parameters to avoid overdetermination for patient sets between 100-300 patients. Such classification masks are obtained by stepwise lowering the percentile thresholds for the learning set from initially 30/70% to typically 15/85% or 10/90%. Disease classification masks for presently up to 10 disease states from up to 64000 parameters can be obtained in this way.

Patients 1,5,10,15 ... of each data set are assigned to the unknown validation set prior to learning to assess the classifer capacity to correctly identify unknown patients. Classifier robustness is subsequently checked by sequentially assigning patients 2,6,11,16... , 3,7,12,17..., and 4,8,13,18... to the unknown validation set prior to relearning (cross classification). CLASSIF1 classifiers are typically robust against cross classification.

Upon completion of classifier learning and validation, a subsequent correlation analysis, using the determined disease classification mask parameters, may detect unknown multistep processes. Patients may benefit in this way immediatedly from new classifiers, while scientists obtain highly discriminatory parameter patterns for further investigations of correlated disease inducing or entertaining processes.

Proposal data repository: It is proposed that publications with CyTOF or other multiparameter cytometry heatmaps contain an Internet link to the heatmap matrix (individual patients (x), discriminating parameter intensities (y)) together with supplementary patient information (disease outcome, age, sex ... etc) on a public internet data repository for further scientfic analysis, similarly to RNA expression array data on the GEO database (-external link 15). This could be a section of the flow repository (-external link 16) for heatmap data, obtained by standardized CyTOF FCS measurement and evaluation for analysis by other classification methods as shown in the case of RNA expression arrays (17).

Proposal direct classification: There is also the possibility to have data in Excel or alternatively in ASCII files with parameter values separated by blanks, Tabs or commas (CSV) directly classified, following notice to the author (see "Mail" below). The first line of such files should contain the sample IDs, the first row the parameter identifiers like CD-antigen names or gene codes. Anonymized patient information (age, sex, disease state, outcome ... etc) should be provided in a separate file of the same format. Data of at least 40-50 individuals are required in the beginning to define learning sets of between 30 -40 patients and unknown validation (test) sets containing data of 8-10 patients.

  1. Valet G, Warnecke HH, Kahle H. Automated diagnosis of malignant and other abnormal cells by flow-cytometry using the newly developed DIAGNOS1 program system. In:
Proc Int Symp Histometry 1986, Eds: Burger G, Ploem B, Goerttler K, Academic Press, London, p 58-67 (1987)
  2. Valet G, Valet M, Tschöpe D, Gabriel H, Rothe G, Kellermann W, Kahle H. White cell and thrombocyte disorders: Standardized, self-learning flow cytometric list mode data classification with the CLASSIF1 program system.
Ann NY Acad Sci 677:233-251 (1993)
  3. Valet G. Human cytome project: A new potential for drug discovery. In:
Las Omicas genomica, proteomica, citomica y metabolomica: modernas tecnologias para desarrollo de farmacos.
Ed: Real Academia Nacional de Farmacia, Madrid, p 207-228 (2005)
  4. Valet G. CLASSIF1 data analysis in multiparameter flow cytometry. (
  -external link 5. International Human Genome Sequencing Consortium Publishes Sequence and Analysis of the Human Genome February 12, 2001
  -external link 6. Eisen MB, Spellman PT, Brown PO, Botstein D Cluster analysis and display of genome-wide expression patterns, PNAS USA 95:14863–14868 (1998)
  -external link 7. Rosenwald AR et al. The Use of Molecular Profiling to Predict Survival after Chemotherapy for Diffuse Large B-cell Lymphoma.
NEJM 346:1937-1947 (2002)
  -external link 8. Grau1 M, Lenz G, Lenz P. Dissection of gene expression datasets into clinically relevant interaction signatures via highdimensional correlation maximization. Nat Comm 10:5417 | | (2019)
  -external link 9. Perfetto S, Chattopadhyay P, Roederer M. Seventeen-colour flow cytometry: unravelling the immune system.
Nat Rev Immunol 4:648–655 (2004).
-external link 10. Bendall SC, Nolan GP, Roederer M Chattopadhyay PK A Deep Profiler’s Guide to Cytometry.
Trends Immunol 33:323-330 (2012)
-external link 12. Krieg C et al. High-dimensional single-cell analysis predicts response to anti-PD-1 immunotherapy.
Nature Medicine 24:144-153 (2018)
-external link 13. Fu W et al.   CyTOF Analysis Reveals a Distinct Immunosuppressive Microenvironment in IDH Mutant Anaplastic Gliomas.
Front Oncol 10:560211 (2021) doi: 10.3389/fonc.2020.560211
-external link 14. Lakoumentas J et al. Optimizations of the naive-Bayes classifier for the prognosis of B-Chronic Lymphatic Leukemia incorporating flow cytometry.
Comp Meth Prog Biomed 108:158-167 (2012)
-external link 15. Gene Expression Omnibus (GEO) database (
-external link 16. FlowRepository (
17. DLBCL reanalysis (

© 2024 G.Valet
last update: Jun 04,2022
first display: Feb 01,2022