Hybrid Neural Document Clustering Using Guided Self-organisation and WordNet

Chihli Hung , Stefan Wermter , Peter Smith

IEEE Intelligent Systems, Volume 19, Number 2, pages 68--77, - Mar 2004

Associated documents :

Document clustering is text processing that groups documents with similar concepts. Its usually considered an unsupervised learning approach because theres no teacher to guide the training process, and topical information is often assumed to be unavailable. In contrast, document classification is usually considered a supervised learning approach because preclassified information guides the training process. If, however, the corpus offers topical information, both classification and clustering techniques can take advantage of words relationships to different topical concepts with different weights. In this case, a guided neural network based on topical information lets users exploit the domain knowledge and reduces the gap between human topical concepts and data-driven clustering decision. The self-organizing map is a network for guided or unguided clustering. SOM combines nonlinear projection, vector quantization (VQ), and data-clustering functions.1As Teuvo Kohonen and colleagues point out, one should provide the different words with such weights that reflect their significance or power of discrimination between the topics.2 They suggest using the vector space model (VSM) to transform documents to vectors if no topical information is provided. However, they also state: If, however, the documents have some topical classification which contains relevant information, the words can also be weighted according to their Shannon entropy over the set of document classes.2 In fact, their WebSOM project uses a modified VSM that includes topical information. Our guided self-organization approach is motivated in a similar manner but we further integrate topical and semantic information from WordNet. Because a document-training set with preclassified information implies relationships between a word and its preference class, we propose a novel document vector representation approach to extract these relationships for document clustering. Furthermore, merging statistical methods, competitive neural models, and semantic relationships from symbolic WordNet, our hybrid learning approach is robust and scales up to a real-world task of clustering 100,000 news documents.

@Article{HWS04a, 
 	 author =  {Hung, Chihli and Wermter, Stefan and Smith, Peter},  
 	 title = {Hybrid Neural Document Clustering Using Guided Self-organisation and WordNet}, 
 	 journal = {IEEE Intelligent Systems},
 	 number = {2},
 	 volume = {19},
 	 pages = {68--77},
 	 year = {2004},
 	 month = {Mar},
 	 publisher = {IEEE},
 	 doi = {}, 
 }