Random and Synthetic Over-Sampling Approach to Resolve Data Imbalance in Classification

High accuracy value is one of the parameters of the success of classification in predicting classes. The higher the value, the more correct the class prediction. Many classification algorithms have been used, such as Naïve Bayes, C.45, KNN, and many more. The algorithm continues to be developed by researchers to produce the best accuracy value, as has been observed by [1] upgrading the Evolutionary Algorithm (EA) in previous studies by automatically selecting the best classifier compared to random forest. Also, in the case of the multiclass dataset and multi-level classification, it has been used for medical purposes, namely the prediction of multiple skin lesions[2].

Research on imbalanced data has been widely carried out by researchers with a variety of approaches. [7]- [9] addressing the imbalance data problem at the algorithm level that is an ensemble. Research [8] using RUSboost, LogitBoost, and AdaBoostm1 with the best model results is that RusBoost can classify imbalanced data on high imbalanced ratios. At the same time, [10] proposes Modified Boosted SVM (MBSVM) by making improvements to Wang's Boosted SVM algorithm by updating the imbalance based on distance weights, MBSVM uses 43 datasets.
In the data level approach, using the random sampling method consists of Linear, Shuffled and Stratified to overcome imbalanced data [11] and succeeded in increasing accuracy by which has values of accuracy, precision, recall and AUC> 0.8% Imbalance of data occurs in rare cases, in the case of health "blood donor," most donors are people who routinely donate blood, so they are categories of people who are eligible to donate blood. Predict that a feasible class has high accuracy compared to an improper class because the number of unfit donors is so small that it has very low predictive accuracy. The amount of majority class data is superior to the minority class. Classes become unbalanced causing prediction errors This study proposes an approach at the data level by increasing the number of minority class data randomly and synthetically so that the amount of data in both classes is balanced.

A. Dataset
The experiment used a primary blood donor dataset at a hospital in the blood transfusion unit. The donor data is 246, with 38 features. Programming tools used python programming and needed some libraries to pre-process, data balancing process, algorithm classification implement.

B. Research Stages
The result of pre-processing divided into training data and testing data. Calculation of imbalance ratio in class to see whether imbalance data occurs. SMOTE-OverSampling and random oversampling processes in minority classes to produce balanced data in all categories. Implement the Naïve Bayes classification algorithm and K-NN to find the best modeling accuracy. Classification evaluation to see the comparison of efficiency without using a balancing technique by using a balancing method. Work steps, like the chart below.

C. Pre-Processing
Pre-processing aims to an understanding of data, improve data quality, and functional data mining results. This process has a comprehensive portion of work and spends almost 70% of the data mining process. Some of the preprocessing work is [12] data cleaning, data integration, normalization, data desiccation D. Naïve Bayes Classification methods that are supervised learning or statistics of a class by looking for probabilities [13]. Naïve Bayes classifies data by estimation to determine the probability of P (H | X), where X is the proof and H is the hypothesis and P (H | X) is the probability of posterior H with the condition X to determine the likelihood, or conditional probability [12] ( | ) = ( | ) ( ) ( )

E. K-Nearest Neighbour (KNN)
The K-Nearest Neighbor method searches for the closest number of K data objects or training patterns [14] with an input pattern then selects the class of the most models. K value is the number of closest neighbors that will be involved to determine the prediction of class labels in the test data. K was chosen based on class voting from neighboring K.Calculate distances with neighbors using the Euclidean Distance formula [15] = √∑ ( − ) 2

F. SMOTE & ROS
The imbalanced data case uses the Synthetic Minority Over-sampling Technique (SMOTE) method [16]. SMOTE will make sample data replication in minority class known as synthetic. How to obtain synthesis data at random the nearest sample data [17] as much as the percentage of data duplication desired Class imbalance is a class imbalance where the number of majority classes is greater than the minority class, for example, the data has a ratio of 1: 100 where 1 is a minority and 100 is the majority [18]. Whereas Random Over Sampling is one technique that adds data to a minor class randomly without adding variations in class data [19].
Class imbalance results in machine learning incorrectly classifying classes. This approach makes a replica of a minority class, replication known as synthetic data. Each minority data is made of synthetic data as much as the desired duplication percentage. (1) (2)

G. Confusion Matrices
The performance of a system cannot work 100% correctly. Therefore a classification system must measure its performance using confusion matrices. Confusion matrices are tables that record the results of classification performance [12]. • True negatives(TN): These are the negative tuples that were correctly labeled by the classifier. Let TN be the number of true negatives.
• False positives (FP): These are the negative tuples that were incorrectly labeled as positive (e.g., tuples of class buys computer = no for which the classifier predicted buys computer=yes). Let FP be the number of false positives.
• False negatives (FN): These are the positive tuples that were mislabeled as negative (e.g., tuples of class buys computer = yes for which the classifier predicted buys computer=no). Let FN be the number of false negatives.
The accuracy value uses the accuracy formulation to test the correctness [20]. Precision is a metric that measures performance to get relevant data (the amount of right positive data) while recall measures the performance of relevant data reads against the amount of data (true positive + false negative).

= + = +
The performance measurement uses a confusion matrix with values of precision, recall, and accuracy. Precision is the level of accuracy of information requested with answers, while recall is the level of success of rediscovering information.

III. Results and Discussion
The blood donor dataset is 247, the qualified donor data is 238, and the unqualified is 9. The dataset consists of training data and testing data. According to the table above, the training data for a class is qualified and unqualified, and there is an imbalance of data and imbalance ratio value is 1:16, Imbalance ratio formula [21] = − + The SMOTE Oversampling (SOS) and Random Oversampling techniques (ROS) balanced minority data classes (unqualified) and majority classes (qualified) using python library tools such as the following code.  The ROS method increased the amount of secondary class data (unqualified) randomly taken from the original class date. At the same time, SOS not only increased the amount of data but also adds data variations from the synthetic class originals. The processed data were balancing results the amount of data unqualified equal to the amount of data qualified. Modeling training data used the KNN algorithm and Naïve Bayes. Experiments applied the value of K = 1 to K = 10 on the KNN algorithm. Classification evaluation uses confusion matrix, comparison of accuracy values without data balancing, and with data balancing (SOS and SOS) in the following table. In table 3 above presents a reasonably high accuracy value using non-balancing data on the two classification algorithms, but the prediction results are biased to the majority class. The accuracy of the Naïve Bayes algorithm shows no difference before or after balancing, as in the graph below. The recall value in table 4 (KNN algorithm) without the balancing process is shallow, meaning that the system recognize qualified class correctly only 12% while the class unqualified is 100% The process of balancing data in the minor class succeeded in raising the prediction of the unqualified class significantly to 100% so that the projection of a qualified and unqualified class did not differ significantly. The SOS and ROS balancing methods can increase the value of recall to an almost balanced on the K-NN classifier, like Figure 6 below. Figure 6. Recall -KNN Naïve Bayes algorithm produces better predictions even though it does not use the balancing process compared to the KNN implementation. There is no prediction change in both classes, before or after the balancing process. It has a fixed value of 100% (qualified), and 89% (unqualified class). Likewise, with research [22], the application of balancing using SMOTE OverSampling (SOS) only slightly increases the value of accuracy.

IV. Conclusion
Data that have a class Imbalance case tend to have the classification results more inclined to the majority class (qualified) than the minority class (unqualified). The blood donor dataset has a reasonably high imbalance ratio between the qualified and unqualified class. In this study, achieved a balanced amount of data in both classes. The number of data in the minority class only 9 data increased to 142 data. The application of SOS and ROS succeeded in increasing the accuracy of inappropriate class recognition from 12% to 100% in the KNN algorithm. In contrast, the naïve Bayes algorithm did not experience an increase before and after the balancing process, which was 89%.