![]() |
Medical Bioinformatics in Cytomics |
- Multiparameter flow cytometry data analysis typically concerns the determination of cell frequency within one or several multidimensional gates. An essential part of potentially useful information like fluorescence intensities, average fluorescence surface densities, intercolour fluorescence ratios, coefficients of variation of the fluorescence, light scatter intensity or scatter and fluorescence ratio distributions of the various cell populations remains unconsidererd in this way. The evaluation of the many data columns deriving from such exhaustive analysis of flow cytometric data requires an easily scalable data classification method.
![]() fig.1 Principle of data sieving |
The CLASSIF1 algorithm (1993) performs an unsupervised self learning parallel classification of an essentially unlimited number of data columns. They are combined for example in packages of 50 columns. The most discriminatory columns (fig.1) for up to 10 classification categories are iteratively determined (data sieving) for each package by percentile analysis. The five most informative columns are inserted into new packages, followed by reclassification until 50 or less data columns remain for data pattern classification | |
providing disease classification masks with typically the 5 to 15 most discriminatory data columns (parameters) (fig.12 rightmost table column). Plotting the cumulative value distributions of the selected mask parameters for healthy individuals and infarction risk patients in a ROC (receiver operating characteristic) like way shows their classifaction potential (fig.1a) although the individual curves do not reach the overall potential of data pattern classification fig.6A/6B). |
![]() fig.1a ROC curves for the best five discriminators |
CLASSIF1 classification is: |
![]() fig.12 data pattern of most discriminatory classification parameters |
|
- The potential of data pattern classification in cytomics (system cytometry (1997), cell systems research) (fig.12) concerns the exhaustive knowledge extraction from flow cytometric (single cell molecular bioprofiles) or other multiparameter data by the determination of the most discriminatory data patterns for diagnosis or individualized prediction of disease progression (disease course prediction, outcome prediction). It permits the development of standardized, instrument and laboratory independent data pattern classifiers from flow cytometric list mode, flow bead array, high content image analysis, cDNA (Lymphochip, Affymetrix) or protein expression chip arrays, clinical chemistry, biomedical or clinical data for predictive medicine by cytomics (references: bioinformatics, medical and clinical cytomics, further references) |
Numeric data values are first transformed into
triple matrix characters
(fig.2)
in the following way:
- Lower and upper percentiles like 10% and 90%
(fig.3)
are calulated for each data column of reference patients.
- Data column values are then transformed
(fig.3)
by assigning:
(- =diminished)
to values below the lower percentile,
(0 =unchanged)
to values between the lower and upper percentiles and
(+ =increased)
to values above the upper percentile.
- Disease classification masks ("disease signatures")
(fig.3a)
for each classification category are determined from the most
frequent triple matrix character in each data column of the learning set.
Individual patients are classified according to the highest positional
coincidence between the patient classification mask
("patient signature") and any one of the disease classification masks.
- A classification (confusion) matrix is established between the
known predictive or diagnostic clinical classification of reference
and abnormal patient samples against the same
classification categories for the CLASSIF1 triple matrix
classification. An ideal classification result is characterized by 100%
specificity and sensitivity values in the diagonal boxes
as well as the for the negative and positive predictive values
while the values in all other boxes are 0%
(fig.4).
- The data columns as in the case of myocardial infarction risk
assessment are obtained by flow cytometric determination of
activation antigens on
peripheral blood thrombocytes.
Four databases, each containing eleven parameters
(fig.5).
were calculated from the thrombocyte evaluation windows (gates) in
two parameter histograms.
They were obtained by projection of flow cytometric
four parameter list mode data either onto the foward(FSC)/sideward (SSC)
light scatter or onto the FSC/fluorescence_1 plane to calculate the expression
of thrombocyte surface antigens CD62, CD63 and thrombospondin as
well as of spontaneously attached IgG on normal individuals
or angiographically identified myocardial risk patients.
- Correct classification for all patient samples of
the learning set as well as for the unknown test set
(validation set) of patients is achieved
(fig.6A,6B),
indicating that the CLASSIF1 algorithm has identified a
suitable discriminativ data peattern of thrombocyte parameters.
(fig.7).
|
|
![]() fig.6A/6B myocardial risk assessment |
The selected parameters are statistically significantly different between normal individuals and myocardial risk patients (fig.8).
- The unknown test set of patients was defined prior to learning as data records 1,5,10,15... etc of the normal individuals as well as of the myocardial risk patients and remained hidden to the learning process. As an embedded test set, it was measured under the same conditions as the samples of the learning set.
- The principle of data pattern classification
(fig.9)
assures classification accuracy as well
as classification multiplicity to take care of the many
combinatorial possibilities between genotype and
internal or external exposure influences on observed
molecular cell phenotypes.
- Data patterns represent heat maps
consisting of (-)=decreased, (0)=unchanged and
(+)=increased characters instead of a green, yellow or red
color code.
- The optimization of the initial classification,
containing all data columns
(fig.10A)
is achieved by maximizing the sum of values in the
diagonal boxes of the classification matrix.
The classification process reduces the number of classification
parameters in this example from initially 44
to finally 5 data columns by the exclusion of non informative
parameters
(fig.10B).
- Initially, the most frequent triple matrix character of
each database column is inserted into the disease
classification mask of each classification category
(fig.11A).
Data columns without discrimination between classification
categories are removed
(fig.11B).
- Data records (patient classification masks) are classified according
to their highest positional coincidence
with any one of the disease classification masks.
- Multiple classifications occur at equal positional
coincidence with more than one of the disease classification
masks
(for example record #17 (N,R)
fig.11A).
They represent transitional states in disease
or classification errors for example in small
learning sets, or at comparatively small differences
between the different classification categories.
They also occur, as just shown, in the first triple matrix
that is prior to the removal of non-informative
data columns.
- The iterative optimization is achieved
by the temporary removal of a single database column or by variable
combinations of two database columns from the classification
process, followed by reclassification. It is retained whether their
temporary absence of the column(s) has improved or deteriorated the
classification result.
- The columns are then reinserted and the next data columns are temporarily
removed until the positive or negative contribution of all data
columns to the classification process is known.
- Data columns having improved the classification result by their
absence are removed from further consideration. The remaining data
columns represent the optimized disease classification masks.
- The data records of the learning set
are reclassified against the disease classification masks to
assess the achieved discrimination for the various classification categories
(fig.12).
- The data records of the unknown test (validation) set
are subsequently classified to verify the robustness of
classification for unknown data records
(fig.13).
- The classification operations described in chapter 3 and 4
are performed unsupervised that is automkatically by the CLASSIF1
algorithm. No human interference is rquired, once the classification
process has been started.
- The classification of data records, unknown to a learned
classifier, is important to avoid erroneous classifications
of random statistical aberrations.
- To assess the susceptibility of the CLASSIF1 algorithm for
the detection of random statistical aberrations, a 133 column
wide data set of 40 data records was generated using a
random number generator and kindly made available by
Dr.W.Meyer and PD Dr.G.Haroske
(Pathologisches Institut, Universität Dresden). The data columns had
different means and coefficients of variation between 0.97-25.7%
(CV=100*standard deviation/mean). For the classification, each second
data record was assigned in sequence to either the arbitrary category#1
or category#2, resulting in 20 category#1 and 20
category#2 records.
- Records 1,5,10,15,20 of each category were prior to the learning
phase assigned to the unknown test (validation) set. They remained
hidden during the learning phase. This left a learning set of
15 category#1 and 15 category#2 data records
- The classification of the learning set by the CLASSIF1 algorithm
provided a specificity of 100% for the correct
recognition of category#1 records at a sensitivity of 40%
for the recognition of category#2 data records
(fig.14A).
Parameters #29,#77,#133
(fig.15).
were selected with means± standard deviation (SD) of 73.2±6.6/72.8±11.2,
53.3±8.0/59.0±11.3, 25.0±3.7/25.0±5.9 at no statistical
difference between the category#1 and category#2 means of the selected
data columns.
- The classification of the unknown test set of records
resulted in a low specificity of 33.7% for the correct
recognition of category#1 records and a low sensitivity of
20.0% for category#2 records
(fig.14B).
- The display of the triple matrix display of the
learning set
(fig.16)
and of the unknown test
(fig.17)
shows the low quality for the
classification of the random number data set.
- The result emphasizes the well known fact that the
discrimination of random statistical aberrations in a learning
set collapses typically during the classification of
unknown test sets.
- This contrasts to the robustness of classification in case of
existing molecular differences like for the discrimination of
risk patients for myocardial infarction from increased thrombocyte activation
antigens CD62,CD63 and thrombospondin
(fig.5)
where learning and test set patients are in >95% of the cases
correctly classified and statistically significant differences
of the selected parameters exist between both patient groups
(fig.7).
- Triple matrix classifiers are inherently standardized onto the
data of
patient reference samples during the classification process
(standardized multiparameter data classification (SMDC)).
Classifiers are therefore comparable in an instrument and
laboratory independent way, in case no differences between the
reference groups of different institutions are detected by the
CLASSIF1 algorithm.
This is advantageous for consensus formation e.g. on leukemia,
HIV and thrombocyte classifications by immunophenotyping.
- Reference sample data may be obtained in the case of blood
cells from healthy individuals of similar age and sex distribution
such as from blood donor samples (biological standardization).
This overcomes comparability issues between standardized
measurements on different flow cytometers in different institutions
using labelled indicator molecules of the same
specificity but from different sources (instrument and reagent
standardization).
- The systematic analysis of patient
classification masks provides information on individual
genotypic and exposure influences on
expressed data patterns. Such analysis may prove useful for
the development of a periodic system of cells
that is a relational cell classification
system in analogy to the periodic system of elements.
Using the molecular properties of tissue or hematopoietic
stem cells as reference, different
cell types and their activity states could be compared in a
standardized way for example during disease development,
under therapy but also during cell division, cell differentiation
or cell migration.
- The performance of triple matrix classifiers depends
on intralaboratory precision rather than on interlaboratory
accuracy since measurement accuracy cancels out
within certain limits through the normalization of the experimental
values onto the respective mean values of the reference samples in each
database column. Reference groups are typically constituted
from age and sex matched individuals.
The earlier developed multifactorial classification software (DIAGNOS1 1987) determines initially the 5 most discriminatory database columns by percentile analysis, followed by the calculation of the 22 multifactors of these columns and finally of the 5 most discriminatory multifactors. Aneuploidy (only in case of tumors) together with the best discriminator of the 5 initial columns in conjunction with the best discriminating 2 multifactors of the altogether 27 columns database are evaluated by the software. |
|
![]() |
The multifactorial classification (fig.18) does not reach the results of data pattern classification (fig.6A/6B) although similar parameters are selected, This was also observed for other data sets. The 5 most discriminatory data columns of the IgG, CD62, CD63 and thrombospondin (4x5=20) are available for data pattern analysis while multifactorial classification restricts to the 5 most discriminatory of them plus the 22 derived multifactors that is altogether 27 data columns. Despite the higher column number, less information than in data pattern analysis is available where all most discriminatory columns (n=20) were included.
- FCS1.0 and 2.0 list mode files or dBase3 database exports from database (e.g. Access) or table calculation programs (e.g. Excel) are classified by the CLASSIF1 algorithm in various clinical situations:
- the CLASSIF1 algorithm provides access to
predictive medicine
with >95% correct prediction of disease progression
in individual patients as well as to
standardized diagnostic classifications.
- the CLASSIF1 approach facilitates the elaboration of
interlaboratory consensus classifiers in complex
multiparameter data sieving or data mining analysis.
- as a practical consequence, diseases
can be classified at institutions where no sufficient learning sets can
be generated in reasonable times or where costly investigations are
necessary to establish appropriate learning sets.
- furthermore the molecular and biochemical properties of
many body cell systems during disease can be compared
by standardized classification e.g. blood leukocytes versus
tissue or effusion leukocytes.
![]() ISBN: none |
![]() ISSN: 1091-2037 |
![]() ISBN: 1-890473-02-2 |
![]() ISBN: 1-890473-03-0 |
![]() ISBN: 1-890475-05-7 |
CD6 2002
ISBN: 0-97117498-3-3 |
CD7 2003
ISBN: 0-97117498-8-4 |
CD8 2004
ISBN: 1-890473-C6-5 |
Download the ZIP file containing all Cell Biochemistry pages for example into directory: d:\classimed\, unzip into the same directory, enter the address: file:///d:/classimed/cellbio.html into the URL field of the Internet browser to directly access text & figures on your harddisk free of network delays (further information).
© 2025 G.Valet |