A novel clustering algorithm for large-scale text collection and its incremental version

Lei Chen

doi:10.5755/j01.itc.45.2.8666

Authors

Lei Chen Harbin Institute of Technology

DOI:

https://doi.org/10.5755/j01.itc.45.2.8666

Keywords:

vector compression, incremental clustering, self-organizing-mapping, neuron model

Abstract

Nowadays, the fast advance of internet technology has brought two challenges. The first one is explosion of information. The second one is new information appears rapidly. Obviously, clustering is a good solution to help users analyze information automatically, whereas traditional clustering algorithms are only suitable for small-scale and stable text collection. In order to solve this problem, a novel clustering algorithm based on vector compression particularly for large-scale text collection (LDVC) and its incremental version (I-LDVC) are proposed in this paper. LDVC selects related features to compress feature sets. Iterative training idea of self- organizing-mapping (SOM) is also imported in it to optimize selection approach. Besides, when novel texts appear, its incremental version (I-LDVC) can select small samples from original texts to alter neuron model to perform incremental clustering. In order to prevent it from over fitting to new added texts, I-LDVC adjusts the weights of samples along with training process. Experimental results demonstrate that LDVC has better performance and lower time complexity on large-scale text collection, and I-LDVC can cluster unstable text collection very well.

DOI: http://dx.doi.org/10.5755/j01.itc.45.2.8666