Random and Synthetic Over-Sampling Approach to Resolve Data Imbalance in Classification

Mardhiya Hayaty(1*), Siti Muthmainah(2), Syed Muhammad Ghufran(3),


(1) Amikom Yogyakarta University
(2) Amikom Yogyakarta University
(3) Department of Mathematics Abdul Wali Khan University, Mardan Garden Campus
(*) Corresponding Author

Abstract


High accuracy value is one of the parameters of the success of classification in predicting classes. The higher the value, the more correct the class prediction.  One way to improve accuracy is dataset has a balanced class composition. It is complicated to ensure the dataset has a stable class, especially in rare cases. This study used a blood donor dataset; the classification process predicts donors are feasible and not feasible; in this case, the reward ratio is quite high. This work aims to increase the number of minority class data randomly and synthetically so that the amount of data in both classes is balanced. The application of SOS and ROS succeeded in increasing the accuracy of inappropriate class recognition from 12% to 100% in the KNN algorithm. In contrast, the naïve Bayes algorithm did not experience an increase before and after the balancing process, which was 89%.

 


Full Text:

PDF

Article Metrics

Abstract view : 198 times
PDF - 55 times

References


J. C. Xavier-Júnior, A. A. Freitas, T. B. Ludermir, A. Feitosa-Neto, and C. A. S. Barreto, “An evolutionary algorithm for automated machine learning focusing on classifier ensembles: An improved algorithm and extended results,” Theor. Comput. Sci., vol. 805, pp. 1–18, 2019.

N. Hameed, A. M. Shabut, M. K. Ghosh, and M. A. Hossain, “Multi-class multi-level classification algorithm for skin lesions classification using machine learning techniques,” Expert Syst. Appl., vol. 141, p. 112961, 2020.

C. Zhang, C. Liu, X. Zhang, and G. Almpanidis, “An up-to-date comparison of state-of-the-art classification algorithms,” Expert Syst. Appl., vol. 82, pp. 128–150, 2017.

T. Pan, J. Zhao, W. Wu, and J. Yang, “Learning imbalanced datasets based on SMOTE and Gaussian distribution,” Inf. Sci. journal-Elsivier, no. xxxx, 2019.

W. Lu, Z. Li, and J. Chu, “Adaptive Ensemble Undersampling-Boost: A novel learning framework for imbalanced data,” J. Syst. Softw., vol. 132, pp. 272–282, 2017.

M. Palt and M. Palt, “ScienceDirect The Proposal of Undersampling Method for Learning from The Proposal of Undersampling Method for Learning from Imbalanced Datasets Imbalanced Datasets,” Procedia Comput. Sci., vol. 159, pp. 125–134, 2019.

H.-J. Xing and W.-T. Liu, “Robust AdaBoost based ensemble of one-class support vector machines,” Inf. Fusion, vol. 55, no. July 2019, pp. 45–58, 2020.

P. Chujai, K. Chomboon, P. Teerarassamee, N. Kerdprasop, and K. Kerdprasop, “Ensemble Learning For Imbalanced Data Classification Problem,” no. January 2015, pp. 449–456, 2015.

B. Krawczyk, A. Cano, and M. Wozniak, “Selecting local ensembles for multi-class imbalanced data classification,” Proc. Int. Jt. Conf. Neural Networks, vol. 2018-July, 2018.

Sundar R and Punniyamoorthy M, “Performance enhanced Boosted SVM for Imbalanced datasets,” Appl. Soft Comput. J., vol. 83, p. 105601, 2019.

S. Mutrofin, A. Mu’alif, R. V. H. Ginardi, and C. Fatichah, “Solution of class imbalance of k-nearest neighbor for data of new student admission selection,” Int. J. Artif. Intell. Res., vol. 3, no. 2, 2019.

J. Han, Jiawei; Kamber, Micheline; Pei, Data Mining Concepts and Techniques. Elsivier, 2012.

X. Wu et al., Top 10 algorithms in data mining, vol. 14, no. 1. 2008.

O. Kramer, “Dimensionality Reduction with Unsupervised Nearest Neighbors,” Intell. Syst. Ref. Libr., vol. 51, pp. 13–23, 2013.

Okfalisa, I. Gazalba, Mustakim, and N. G. I. Reza, “Comparative analysis of k-nearest neighbor and modified k-nearest neighbor algorithm for data classification,” Proc. - 2017 2nd Int. Conf. Inf. Technol. Inf. Syst. Electr. Eng. ICITISEE 2017, vol. 2018-Janua, pp. 294–298, 2018.

D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance,” Inf. Sci. (Ny)., vol. 505, pp. 32–64, 2019.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, no. February 2017, pp. 321–357, 2002.

G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE,” Inf. Sci. (Ny)., vol. 465, pp. 1–20, 2018.

J. M. Johnson and T. M. Khoshgoftaar, “Deep learning and data sampling with imbalanced big data,” Proc. - 2019 IEEE 20th Int. Conf. Inf. Reuse Integr. Data Sci. IRI 2019, pp. 175–183, 2019.

P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen, “Assessing the accuracy of prediction algorithms for classification: An overview,” Bioinformatics, vol. 16, no. 5, pp. 412–424, 2000.

G. Kovács, “An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets,” Appl. Soft Comput., vol. 83, p. 105662, 2019.

A. Hanskunatai, “A New Hybrid Sampling Approach for Classification of Imbalanced Datasets,” 2018 3rd Int. Conf. Comput. Commun. Syst. ICCCS 2018, pp. 278–281, 2018.




DOI: https://doi.org/10.29099/ijair.v4i2.152

________________________________________________________

International Journal Of Artificial Intelligence Research

Organized by: Departemen Teknik Informatika STMIK Dharma Wacana
Published by: STMIK Dharma Wacana
Jl. Kenanga No.03 Mulyojati 16C Metro Barat Kota Metro Lampung
phone. +62725-7850671
Fax. +62725-7850671
Email: info@ijair.id | internationaljournalair@gmail.com | herinurdiyanto@ieee.org 

View IJAIR Statcounter

Creative Commons License
IJAIR is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.