Improved pathogen recognition using non-euclidean distance metrics and weighted kNN

Tharmakulasingam M., Topal C., Fernando A., La Ragione R.

6th International Conference on Biomedical and Bioinformatics Engineering, ICBBE 2019, Shanghai, China, 13 - 15 November 2019, pp.118-124 identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1145/3375923.3375956
  • City: Shanghai
  • Country: China
  • Page Numbers: pp.118-124
  • Keywords: Backward feature elimination (BFE), Bioinformatics, Distance metrics, Feature selection, Machine learning, Pathogen detection
  • Istanbul Technical University Affiliated: Yes


The timely identification of pathogens is vital in order to effectively control diseases and avoid antimicrobial resistance. Non-invasive point-of-care diagnostic tools are recently trending in identification of the pathogens and becoming a helpful tool especially for rural areas. Machine learning approaches have been widely applied on biological markers for predicting diseases and pathogens. However, there are few studies in the literature that have utilized volatile organic compounds (VOCs) as non-invasive biological markers to identify bacterial pathogens. Furthermore, there is no comprehensive study investigating the effect of different distance and similarity metrics for pathogen classification based on VOC data. In this study, we compared various non-Euclidean distance and similarity metrics with Euclidean metric to identify significantly contributing VOCs to predict pathogens. In addition, we also utilized backward feature elimination (BFE) method to accurately select the best set of features. The dataset we utilized for experiments was composed from the publications published between 1977 and 2016, and consisted of associations in between 703 VOCs and 11 pathogens.We performed extensive set of experiments with five different distance metrics in both uniform and weighted manner. Comprehensive experiments showed that it is possible to correctly predict pathogens by using 68 VOCs among 703 with 78.6% accuracy using k-nearest neighbour classifier and Sorensen distance metric.