Comparison of semantic and single term similarity measures for clustering Turkish documents


Yuecesoy B., Oegueduecue S. G.

6th International Conference on Machine Learning and Applications, Ohio, United States Of America, 13 - 15 December 2007, pp.393-398 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/icmla.2007.52
  • City: Ohio
  • Country: United States Of America
  • Page Numbers: pp.393-398

Abstract

With the rapid growth of the World Wide Web (www), it becomes a critical issue to design and organize the vast amounts of on-line documents on the web according to their topic. Even for the search engines it is very important to group similar documents in order to improve their performance when a query is submitted to the system. Clustering is useful for taxonomy design and similarity search of documents on such a domain. Similarity is fundamental to many clustering applications on hypertext. In this paper, we will study how measures of similarity are used to cluster a collection of documents on a web site. Most of the document clustering techniques rely on single term analysis of text, such as vector space model. To better group of related documents we propose a new semantic similarity measure. We compare our measure with Wu-Palmer similarity and cosine similarity. Experimental results show that cosine similarity perform better than the semantic similarities. We demonstrate our results on Turkish documents. This is a first study that considers the semantic similarities between Turkish documents.