Hybrid Neural Document Clustering Using Guided Self-organisation and WordNet
Document clustering is text processing that groups documents with similar concepts.
Its usually considered an unsupervised learning approach because theres no
teacher to guide the training process, and topical information is often assumed to be unavailable. In contrast, document classification is usually considered a supervised learning approach because preclassified information guides
the training process. If, however, the corpus offers
topical information, both classification and clustering techniques can take advantage of words relationships to different topical concepts with different
weights. In this case, a guided neural network based
on topical information lets users exploit the domain
knowledge and reduces the gap between human topical concepts and data-driven clustering decision.
The self-organizing map is a network for guided or
unguided clustering. SOM combines nonlinear projection, vector quantization (VQ), and data-clustering
functions.1As Teuvo Kohonen and colleagues point
out, one should provide the different words with
such weights that reflect their significance or power
of discrimination between the topics.2 They suggest
using the vector space model (VSM) to transform
documents to vectors if no topical information is
provided. However, they also state: If, however, the
documents have some topical classification which
contains relevant information, the words can also be
weighted according to their Shannon entropy over
the set of document classes.2 In fact, their WebSOM
project uses a modified VSM that includes topical
information.
Our guided self-organization approach is motivated in a similar manner but we further integrate
topical and semantic information from WordNet.
Because a document-training set with preclassified
information implies relationships between a word
and its preference class, we propose a novel document vector representation approach to extract these
relationships for document clustering. Furthermore,
merging statistical methods, competitive neural models, and semantic relationships from symbolic WordNet, our hybrid learning approach is robust and
scales up to a real-world task of clustering 100,000
news documents.
@Article{HWS04a, author = {Hung, Chihli and Wermter, Stefan and Smith, Peter}, title = {Hybrid Neural Document Clustering Using Guided Self-organisation and WordNet}, journal = {IEEE Intelligent Systems}, number = {2}, volume = {19}, pages = {68--77}, year = {2004}, month = {Mar}, publisher = {IEEE}, doi = {}, }