A Comparison Support Vector Machine, Logistic Regression And Naïve Bayes For Classification Sentimen Analisys user Mobile App

ABSTRACT


Introduction
There are a lot of mobile app users [1], which also makes many companies try to create mobilebased applications.Therefore, many of the evaluations given by users face these applications in comments and also in satisfaction ratings.Based on this, it is necessary to prove whether the three algorithms used in this research can work optimally.Naive Bayes, support vector machine and logistic regression are the same types of supervised learning [2], [3].Then the application that is the object of this research is the health service provider that is BPJS, where the data and users are adequate.At the time of this research, its downloads had reached more than ten million.
The text used as a reference to assess the feelings of the user also corresponds to the three algorithms selected [4].The use of supervised learning algorithms in data mining depends on the presence of "training data" consisting of appropriate input and output.This algorithm studies the relationship between this input and output to create predictive models that can be used to predict unknown outputs based on the given input [5].
Data mining has the advantage of processing information that is still unclear or vague [6].Where the data is mined with the scrapping process and then will produce the data, the process will then back up to the conclusion or classification stage, after which it can then be seen how the third level of Data is the most important thing, the use of data can be useful to get an evaluation from the user of a system or application that is built based on mobile.Not only, the assessment or acceptance results of mobile applications during the trial stage are considered important, assessments and comments from direct users are also important things that can be input for mobile application developers.Data mining, or known in English as data mining, is the answer to the process of retrieving data on any media.In this research, data mining is carried out on the media mobile application download service provider Google Playstore, which provides data in the form of comments and ratings.After scraping the data and obtaining the latest data parameters determined by the latest 2000 comments, the data is pre-processed by removing the emot icon character and eliminating unneeded variables so that the data obtained can be processed to the next stage, namely classification based on ratings and sentiment comments.The algorithms used or compared in this research are Support Vector machine, logistic regression and naïve bayes which are known to be reliable in data mining processing.In this research, the accuracy results are 88% for SVM, 90.5% for Logistic Regression and 91% for naïve bayes.accuracy of the algorithm And it could be a further research recommendation to optimize these three fundamental algorithms.

Method
The phase in this research is only through a few stages, which can be described in image 1 of the research course.This is then solved with algorithm performance testing.

Collecting Data
The data collection process used scraping taken from the Google Play Store website after determining the parameters and the amount of data required.The data was collected from 2,000 recent comments due to responses to applications that are already being developed.Figure 2 shows the flow of data mining processes on the website Google Playstore [7] [8].

Fig 2. Flow chart of Collecting data
This phase is all conducted using Google Collab, and its implementation only generates from the web.

Pre-Processing
The next process is to clear the data that initially has many reading marks and then receives icons and others.It is done with case folding, tokenizing, stemming, stopword removal, and TF-IDF [9], [10].Pre-processing avoids irregular, imperfect, and inconsistent data.This stage is also a maximum-determining stage of whether or not the algorithm will work, such as the phase of elimination of non-standard languages.The parameters in the Python code must be given the language or word parameters that want to be removed and those that will be retained.This is the result that will optimize the algorithm in the next work.

Classification
Classification is one of the main tasks of supervised learning in machine learning and data mining.The concept of classification comes from the main purpose of this technique, which is to predict the class of the input data [11], [12].The purpose of this classification is a computer process that uses a data mining algorithm to process the review set of the Google Play Store Grab application.Several algorithms are commonly used for data mining classification, including Naive Bayes, Support Vector Machine (SVM), and Logistic Regression.

Logistic Regression
Logistic regression is a type of regression analysis used to explain the relationship between a dependent variable and an independent variable by linking one or more independent variables to the dependent variable.Class types can be 0 and 1, true or false, major or minor.The type of independent variable is category.This distinguishes logistic regression from multiple or other linear regression [13]- [15].The logistic regression equation is expressed by Eq.
B o is a constant, while B 1 is a coefficient of each variable, the value of p is found hiin equation (2).

Naïve Bayes
One of the algorithms that serves to divide classes in the process of classification is this algorithm.Most of the time, these algorithms are used for data mining, making them one of the most popular algorithms.For that, we see what approaches this algorithm uses in data mining research in the formula (3) [16]- [18].
X is the unknown class data, H is the separate class x hypothesis data, while P(X|H)is the conditional probability of the H hypotheses. P (H) is the probability of the hypothesis H.

Support Vector Machine
The vector machine support algorithm is a popular machine-learning technique for text classification and has good performance in many fields.The SVM's ability to detect hyperplanes separately between two different classes is maximized, and the SVM provides the maximum distance between the data that is closest to the hyperplane.In this research, the kernel formula is used, as seen in the formula (4).

Evaluation
This step is taken to ensure the validity of the test; the aim of this evaluation is to find the best results from the test results [20].Measure the accuracy of the model using a confusion matrix.A confusion matrix is a tool for analyzing how a classification model identifies a different set of data [21].

Result of Clarification
This research found results on the object aimed at, which is a mobile application provider or organizer of national health jamina in Indonesia, namely BPJS.As a result, the sentiment can be seen in Figure 4. Results are obtained from data mining using scraping techniques, and after classification, there are more than 1400 positive comments in the data, while comments and negative ratings are only rated for 500 comments.Where these results are more than adequate in terms of mobile application development as assessed by the user directly.

Comparison of algorithms
Comparison in this research is comparing accuracy, precision and recall using the same datasets and test data.The results can be seen in Table 1.The explanation can be seen in the results of the encoding performed at the time of the classification process in Figures 5, 6, and 7.

Evaluation
In this research, the evaluation is seen from the confusion matrix, which will explain both the positive and negative aspects of the pedicure and whether the outcome is direct or not.

Conclusion
The results of this research showed that the accuracy of sentimental analysis was outperformed by Naïve Bayes, followed by logistic regression, and finally by the support vector machine.Thus, for data mining performed on the media, Google Play Store can use the Naïve Bayes algorithm and compare it with other supervised learning algorithms.This research also depends on what mobile applications are used as objects because, in some cases, it fails to obtain application data created based on the.gov extension.

Fig 4 .
Fig 4. Classification of Positive and Negative Emotions

Fig 8 .
Fig 8. Confusion Matrix SVM Based on Figure 9 above, the number of TP is 108, FP is 28, FN is 39, and TN is 425.How to calculate it manually is with TP + FP + FN + TN = 600, and next (TP + TN)/600 = 0.888.The result of the calculation of the Confusion Matrix Support Vector Machine is 88.8%.

Figure 9 .
Figure 9. Confusion Matrix Naïve BayesBased on Figure9, the number of TP is 134, FP is 37, FN is 13, and TN is 416.How to calculate it manually is with TP + FP + FN + TN = 600, and next (TP + TN)/600 = 0.916.The result of the calculation of the Confusion Matrix Naïve Bayes was 91.6%.

Figure 10 .
Figure 10.Confusion Matrix Logistic Regression Based on Figure 9, the number of TP is 116, FP is 26, FN is 31, and TN is 427.How to calculate it manually is with TP + FP + FN + TN = 600, and next (TP + TN)/600 = 0.905.The result of the calculation of the Confusion Matrix Naïve Bayes is 90.5%.