Labeling Turkish News Stories with CRF

7th International Conference on Application of Information and Communication Technologies (AICT), Baku, Azerbaycan, 23 - 25 Ekim 2013, ss.384-388

Yayın Türü: Bildiri / Tam Metin Bildiri
Basıldığı Şehir: Baku
Basıldığı Ülke: Azerbaycan
Sayfa Sayıları: ss.384-388
İstanbul Teknik Üniversitesi Adresli: Evet

Özet

Drastically document increase in Web requires semantic web applications in order to lead the Web to its full potential. Extracting important phrases in a document facilitates finding expected information. In this paper, a new approach that is labeling the main subject, main predicate, main location and main date of an electronic document is introduced. The main subject label tells whom or what the document about. The main predicate label tells what the subject is or does. The main location label tells where the activities passed and the main date label tells when the document passed. With the help of this new methodology, extraction of not only high level description of the content, but also the attribute of a phrase in a document is provided. As experimental set, Turkish news stories are selected. To use as a training and test set, manual labeling is made by human annotators. Then, different models for each label are implemented to extract the labels automatically and they are compared to manually labeled results to evaluation process of this study.