GP-Fileprints: File Types Detection Using Genetic Programming


Kattan A., Galvan-Lopez E., Poli R., O'Neill M.

13th European Conference on Genetic Programming, İstanbul, Turkey, 7 - 09 April 2010, vol.6021, pp.134-135 identifier

  • Publication Type: Conference Paper / Full Text
  • Volume: 6021
  • City: İstanbul
  • Country: Turkey
  • Page Numbers: pp.134-135

Abstract

We propose a novel application of Genetic Programming (GP): the identification of file types via the analysis of raw binary streams (i.e., without the use of meta data). GP evolves programs with multiple components. One component analyses statistical features extracted from the raw byte-series to divide the data into blocks. These blocks are then analysed via another component to obtain a signature for each file in a training set. These signatures are then projected onto a two-dimensional Euclidean space via two further (evolved) program components. K-means clustering is applied to group similar signatures. Each cluster is then labelled according to the dominant label for its members. Once a program that achieves good classification is evolved it can be used on unseen data without requiring any further evolution. Experimental results show that GP compares very well with established file classification algorithms (i.e., Neural Networks, Bayes Networks and J48 Decision Trees).