PROTEIN CLASSIFICATION ARTIFICIAL NEURAL SYSTEM: A FILTER PROGRAM FOR DATABASE SEARCH
A neural network classification method has been developed as an alternative approach to the large database search/organization problem. The system, termed Protein Classification Artificial Neural System (ProCANS), is implemented on a Cray Y-MP8/864 supercomputer for rapid superfamily classification of unknown proteins based on the information content of the neural interconnections. The system employs an n-gram hashing function for sequence encoding and modular back-propagation networks for classification. The system was developed with the first 2,724 entries in 690 superfamilies of the annotated PIR (Protein Identification Resource) protein sequence database. Three prediction sets were used to evaluate the system performance. The first consists of 651 annotated entries randomly chosen from the 690 superfamilies. The second set consists of 482 unclassified entries from the preliminary PIR database, whose superfamilies were identified by the fasta, blastp and sp database search methods. The third set is a subset of data set 2 with only superfamilies of more than 20 entries. At a low cut-off score of 0.01, the sensitivity is 92, 82 and 100%, respectively, for the three prediction sets. At a high cut-off score of 0.9, on the other hand, a close to 100% specificity is achieved with a reduced sensitivity. The classification accuracy is determined by three factors: the degree of similarity, the sequence length, and the size of the superfamily. The classification on neural nets is fast (i.e., less than 0.5 Cray CPU second per sequence on a full-scale system). The speed would not be constrained by database sizes because the search time grows with the number of superfamilies which is likely to remain low. Therefore, ProCANS can be used as a filter program to provide a reduced search space and speed up database searches. The rapid superfamily identification provided by ProCANS would be particularly valuable to the organization of protein sequence databases and to the gene recognition in large sequencing projects. A current extension to ProCANS is the incorporation of motif information to further improve its sensitivity. The design concept has also been applied to the classification of nucleic acid sequences. A preliminary result showed a 96% accuracy for 16S ribosomal RNA classification. The software tool is generally applicable to any second generation databases that are organized according to family relationships.