An improvement of centroid-based classification algorithm for text classification


Çataltepe Z. , Aygun E.

IEEE 23rd International Conference on Data Engineering Workshop, İstanbul, Turkey, 17 - 20 April 2007, pp.952-956 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/icdew.2007.4401090
  • City: İstanbul
  • Country: Turkey
  • Page Numbers: pp.952-956

Abstract

k-nearest neighbor and centroid-based classification algorithms are frequently used in text classification due to their simplicity and performance. While k-nearest neighbor algorithm usually performs well in terms of accuracy, it is slow in recognition phase. Because the distances/similarities between the new data point to be recognized and all the training data need to be computed. On the other hand, centroid-based classification algorithms are very fast, because only as many distance/similarity computations as the number of centroids (i.e. classes) needs to be done. In this paper, we evaluate the performance of centroid-based classification algorithm and compare it to nearest mean and nearest neighbor algorithms on 9 data sets. We propose and evaluate an improvement on centroid-based classification algorithm. Proposed algorithm starts from the centroids of each class and increases the weight of misclassified training data points on the centroid computation until the validation error starts increasing. The weight increase is done based on the training confusion matrix entries for misclassified points. The proposed algorithm results in smaller test error than centroid-based classification algorithm in 7 out of 9 data sets. It is also better than 10-nearest neighbor algorithm in 8 out of 9 data sets.