THE EFFECTS OF FEATURE SELECTION METHODS ON THE CLASSIFICATIONS OF IMBALANCED DATASETS

Femi Dwi Astuti(1), Indra Yatini Buryadi(2),


(1) Informatic, Universitas Teknologi Digital Indonesia, Daerah Istimewa Yogyakarta
(2) Informatic, Universitas Teknologi Digital Indonesia, Daerah Istimewa Yogyakarta
Corresponding Author

Abstract


imbalanced data often results in less than optimal classification. Also, datasets with a large number of attributes tends to make the classification results not too good, and in order get better classification accuracy results, one thing that could be done is to perform pre-processing to select the features to be used in the classification. This research uses information gain and gain ratio feature selection algorithms for the pre-processing stage prior to classification, and Naïve Bayes algorithm for the classification. The test is performed to determine the values of accuracy, precision, recall from the classification process without feature selection; accuracy value with information gain feature selection; accuracy value with gain ratio; and accuracy value with CBFS feature selection. The results are then compared to determine which feature selection algorithm gives the best results when applied to data with imbalanced classes. The results showed that the classification accuracy on the default of credit card client dataset using Nave Bayes algorithm was 64.27%. The information gain feature selection was able to increase the accuracy by 5.27% (from 64.27% to 69.54%), while the gain ratio feature selection was able to increase the accuracy by 14.19% (from 64.27% to 78.46%). In this case, the gain ratio is more suitable for data with greatly varied attribute values.

Keywords


Imbalance Class, Gain Ratio, Information Gain, Naïve Bayes

References


A. Ali, S. M. Shamsuddin, and A. L. Ralescu, “Classification with class imbalance problem: A review,” Int. J. Adv. Soft Comput. its Appl., vol. 7, no. 3, pp. 176–204, 2015.

S. Zhang, S. Sadaoui, and M. Mouhoub, “An Empirical Analysis of Imbalanced Data Classification,” Comput. Inf. Sci., vol. 8, no. 1, pp. 151–162, 2015, doi: 10.5539/cis.v8n1p151.

D. Mladenić and M. Grobelnik, “Feature selection for unbalanced class distribution and Naive Bayes,” Proc. Sixt. Int. Conf. Mach. Learn., no. January, pp. 258–267, 1999, doi: 10.1214/aoms/1177705148.

G. Forman, “An extensive empirical study of feature selection metrics for text classification,” J. Mach. Learn. Res., vol. 3, no. March 2003, pp. 1289–1305, 2003.

Y. Hu et al., “An Improved Algorithm for Imbalanced Data and Small Sample Size Classification,” J. Data Anal. Inf. Process., vol. 03, no. 03, pp. 27–33, 2015, doi: 10.4236/jdaip.2015.33004.

D. Tiwari, “Handling Class Imbalance Problem Using Feature Selection,” Int. J. Adv. Res. Comput. Sci. Technol., vol. 2, no. 2, pp. 516–520, 2014.

A. I. Pratiwi and Adiwijaya, “On the Feature Selection and Classification Based on Information Gain for Document Sentiment Analysis,” Appl. Comput. Intell. Soft Comput., vol. 2018, 2018, doi: 10.1155/2018/1407817.

P. P. R., V. M.L., and S. S., “Gain Ratio Based Feature Selection Method for Privacy Preservation,” ICTACT J. Soft Comput., vol. 01, no. 04, pp. 201–205, 2011, doi: 10.21917/ijsc.2011.0031.

I. Pratama, P. P.-I. JOURNALS, and undefined 2020, “Multiclass Classification with Imbalanced Class and Missing Data,” Ijconsist.Org, no. September, pp. 1–6, 2020, [Online]. Available: https://ijconsist.org/index.php/ijconsist/article/view/25.

D. Xhemali, C. J. Hinde, and R. G. Stone, “Naive Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages,” Int. J. Comput. Sci., vol. 4, no. 1, pp. 16–23, 2009, [Online]. Available: http://cogprints.org/6708/.


Full Text: PDF

Article Metrics

Abstract View : 89 times
PDF Download : 42 times

DOI: 10.56327/ijiscs.v6i3.1279

Refbacks

  • There are currently no refbacks.