Feature Word Vector Based on Short Text Clustering
A feature word vector based on short text clustering algorithm is proposed in this paper to solve the poor clustering of short text caused by sparse feature and quick updates of short text. Firstly, the formula for feature word extraction based on word part-of-speech (POS) weighting is defined and used to extract a feature word as short text. Secondly, the word vector that represents the semantics of the feature word was obtained through training in large-scale corpus with the Continuous Skip-gram Model. Finally, Word Mover’s Distance (WMD) was used to calculate similarity of short texts for short text clustering in the hierarchical clustering algorithm. The evaluation of four testing datasets revealed that the proposed algorithm is significantly superior to traditional clustering algorithms, with a mean F value of 55.43% on average higher than the second best method.