When does synthetic data generation work? Yapay örnek üretimi ne zaman işe yarar?


Creative Commons License

Topal A. , Amasyalı M. F.

29th IEEE Conference on Signal Processing and Communications Applications, SIU 2021, Virtual, Istanbul, Turkey, 9 - 11 June 2021 identifier

  • Publication Type: Conference Paper / Full Text
  • Volume:
  • Doi Number: 10.1109/siu53274.2021.9477956
  • City: Virtual, Istanbul
  • Country: Turkey
  • Keywords: Border Examples, Random Generation, Sampling, Synthetic Data Generation

Abstract

© 2021 IEEE.Synthetic data generation is one of the methods used in machine learning to increase the performance of algorithms on datasets. However, these methods do not ensure success on each dataset. In this study, it has been investigated that which type of synthetic data generation algorithms are useful in which datasets by examining the effects of SMOTE, Borderline-SMOTE and Random data generation algorithms on 33 datasets. For this, each dataset has been fully balanced as a result of synthetic data generation. In order to evaluate the results, datasets are divided into three groups as balanced, partially balanced-unbalanced and unbalanced in accordance with the unbalance ratio. The datasets formed as a result of the data generation of the algorithms and the original datasets have been trained with an ANN models and their performance has been evaluated on the test set. Experimental results have shown that adding synthetic data to the datasets with the abovementioned algorithms generally increases the success in balanced and partially balanced-unbalanced datasets, but generally does not work in unbalanced datasets. Borderline-SMOTE, which produces border samples in balanced datasets, and SMOTE in partially balanced-unbalanced datasets have been more successful.