World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Spring Sale: Get 35% off with a min. purchase of 2 titles. Use code SPRING35. Valid till 31st Mar 2025.

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

CLASSIFICATION OF PROTEIN SEQUENCES BASED ON WORD SEGMENTATION METHODS

    https://doi.org/10.1142/9781848161092_0020Cited by:6 (Source: Crossref)
    Abstract:

    Protein sequences contain great potential revealing protein function, structure families and evolution information. Classifying protein sequences into different functional groups or families based on their sequence patterns has attracted lots of research efforts in the last decade. A key issue of these classification systems is how to interpret and represent protein sequences, which largely determines the performance of classifiers. Inspired by text classification and Chinese word segmentation techniques, we propose a segmentation-based feature extraction method. The extracted features include selected words, i.e., substrings of the sequences, and also motifs specified in public database. They are segmented out and their occurrence frequencies are recorded as the feature vector values. We conducted experiments on two protein data sets. One is a set of SCOP families, and the other is GPCR family. Experiments in classification of SCOP protein families show that the proposed method not only results in an extremely condensed feature set but also achieves higher accuracy than the methods based on whole k-spectrum feature space. And it also performs comparably to the most powerful classifiers for GPCR level I and level II subfamily recognition with 92.6 and 88.8% accuracy, respectively.