Research PapersNo Access

NOISE-ROBUST SOFT CLUSTERING OF GENE EXPRESSION TIME-COURSE DATA

MATTHIAS E. FUTSCHIK

Institute of Theoretical Biology, Humboldt-University, Invalidenstr. 43, 10115 Berlin, Germany

Department of Information Science, University of Otago, PO Box 56, Dunedin, New Zealand

Search for more papers by this author

and

BRONWYN CARLISLE

Department of Biochemistry, University of Otago, PO Box 56, Dunedin, New Zealand

Search for more papers by this author

https://doi.org/10.1142/S0219720005001375Cited by:353 (Source: Crossref)

Abstract

Clustering is an important tool in microarray data analysis. This unsupervised learning technique is commonly used to reveal structures hidden in large gene expression data sets. The vast majority of clustering algorithms applied so far produce hard partitions of the data, i.e. each gene is assigned exactly to one cluster. Hard clustering is favourable if clusters are well separated. However, this is generally not the case for microarray time-course data, where gene clusters frequently overlap. Additionally, hard clustering algorithms are often highly sensitive to noise.

To overcome the limitations of hard clustering, we applied soft clustering which offers several advantages for researchers. First, it generates accessible internal cluster structures, i.e. it indicates how well corresponding clusters represent genes. This can be used for the more targeted search for regulatory elements. Second, the overall relation between clusters, and thus a global clustering structure, can be defined. Additionally, soft clustering is more noise robust and a priori pre-filtering of genes can be avoided. This prevents the exclusion of biologically relevant genes from the data analysis. Soft clustering was implemented here using the fuzzy c-means algorithm. Procedures to find optimal clustering parameters were developed. A software package for soft clustering has been developed based on the open-source statistical language R. The package called Mfuzz is freely available.

Keywords:

References

A. Jain and R. Dubes , Algorithms for Clustering Data ( Prentice Hall , New Jersey , 1988 ) . Google Scholar
S. Tavazoieet al., Nat. Genet. 22, 281 (1999), DOI: 10.1038/10343. Crossref, Medline, Google Scholar
P. Törönenet al., FEBS Lett. 451(2), 142 (1999), DOI: 10.1016/S0014-5793(99)00524-4. Crossref, Medline, Google Scholar
R. Sharon and R. Shamir, CLICK: a clustering algorithm with applications to gene expression data, Proceedings of the RECOMB 1999 (1999) pp. 307–316. Google Scholar
M. B. Eisenet al., Proc. Natl. Acad. Sci. USA 95(1), 14863 (1998), DOI: 10.1073/pnas.95.25.14863. Crossref, Medline, Google Scholar
P. T. Spellmanet al., Mol. Biol. Cell 9(12), 3273 (1998). Crossref, Medline, Google Scholar
R. J. Choet al., Mol. Cell 2, 65 (1998), DOI: 10.1016/S1097-2765(00)80114-8. Crossref, Medline, Google Scholar
S. Chuet al., Science 282(5389), 699 (1998), DOI: 10.1126/science.282.5389.699. Crossref, Medline, Google Scholar
J. C. Bezdak , Pattern Recognition with Fuzzy Objective Function Algorithms ( Plenum Press , New York , 1981 ) . Crossref, Google Scholar
I. Geth and A. B. Geva, Trans. Pattern Analysis Machine Intell. 11, 773 (1989), DOI: 10.1109/34.192473. Crossref, Google Scholar
R. O. Duda , P. E. Hart and D. G. Stork , Pattern Classification ( Wiley , New York , 2001 ) . Google Scholar
P. Tamayoet al., Proc. Natl. Acad. Sci. USA 96(6), 2907 (1999), DOI: 10.1073/pnas.96.6.2907. Crossref, Medline, Google Scholar
A. Lukashin and R. Fuchs, Bioinformatics 17(5), 405 (2000), DOI: 10.1093/bioinformatics/17.5.405. Crossref, Medline, Google Scholar
O. Troyanskayaet al., Bioinformatics 17(6), 520 (2001), DOI: 10.1093/bioinformatics/17.6.520. Crossref, Medline, Google Scholar
L. J. Heyer, S. Kruglyak and S. Yooseph, Genome Res. 11, 1106 (1999), DOI: 10.1101/gr.9.11.1106. Google Scholar
T. R. Hugheset al., Cell 102(1), 109 (2000), DOI: 10.1016/S0092-8674(00)00015-5. Crossref, Medline, Google Scholar
F. P. Rothet al., Nature Biotech. 16, 939 (1998), DOI: 10.1038/nbt1098-939. Crossref, Google Scholar
J. D. Hugheset al., J. Mol. Biol. 296, 1205 (2000), DOI: 10.1006/jmbi.2000.3519. Crossref, Medline, Google Scholar
A. P. Gasch and M. B. Eisen, Genome Biology 3(11), research0059.1 (2002), DOI: 10.1186/gb-2002-3-11-research0059. Crossref, Google Scholar
D. Dembele and P. Kastner, Bioinformatics 19(8), 973 (2002), DOI: 10.1093/bioinformatics/btg119. Crossref, Medline, Google Scholar