Comparison of similarity measures for clustering Turkish documents

Madylova, Ainura; Oguducu, Şule

doi:10.3233/ida-2009-0394

Comparison of similarity measures for clustering Turkish documents

Atıf İçin Kopyala

Madylova A., Oguducu S. G.

INTELLIGENT DATA ANALYSIS, cilt.13, sa.5, ss.815-832, 2009 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 13 Sayı: 5
Basım Tarihi: 2009
Doi Numarası: 10.3233/ida-2009-0394
Dergi Adı: INTELLIGENT DATA ANALYSIS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Sayfa Sayıları: ss.815-832
İstanbul Teknik Üniversitesi Adresli: Evet

Özet

Text clustering has become an important part of the web data organization with the rapid growth of the World Wide Web (www). Clustering simplifies web search engine work by grouping large amount of documents, retrieved according to a given query. Similarity measures used in clustering affect the output of the grouping directly. Most of the document clustering techniques rely on single term analysis of text, such as vector space model. In order to improve grouping of Turkish documents, we investigate several similarity measures based on the semantic similarity of terms. Moreover, some techniques for calculating documents similarity are studied. The aim of this paper is to study the effects of semantic and single term similarity measures to the clustering results of Turkish documents. All experiments are carried out on Turkish web sites, taking into account the relationships of terms based on the ontology for the Turkish language.