Integrating supervised and unsupervised learning in genomics applications.
Most histologic classifications of major cancers includelarge heterogeneous classes. Identification of clinically relevant subgroups within these classes is one of the most important challenges in cancer genomics. In this context, our approach will be to seek undiscovered
subclasses in broad classes, exploiting a potential biological connection between the unclassified group and known classifications working for tumors in other organs.
Statistically, this problem can be thought of as a unified semi-supervised analysis, where a known classification is exported to help the clustering
procedure.
Our implementation is a Bayesian parametric model based on Normal Mixtures and amenable to MCMC computing.
Combinatorial mixtures of singular distributions characterize the set of the a priori assumptions. Different independence scenarios are analysed.
The solution is illustrated using data on molecular classification of breast and lung cancers.