Detection of SQL Injection Attack Using Machine Learning Based on Natural Language Processing

methods ABSTRACT There has been a significant increase in the number of cyberattacks. This is not only happening in Indonesia but also in many countries. Thus, the issue of cyber attacks should receive attention and be interesting to study. The Open Web Application Security Project has published the Top-10 website vulnerabilities regarding the explored security vulnerabilities. SQL Injection is still one of the website vulnerabilities that attackers often exploit. This research has implemented and tested five algorithms. They are Naïve Bayes, Logistic Regression, Gradient Boosting, K-Nearest Neighbor, and Support Vector Machine. In addition, this study also uses natural language processing to increase detection accuracy as a part of text processing. Therefore, the primary dataset was converted to the corpus to make it easier to be analyzed. This process was carried out in the feature engineering stage. This study used two datasets of SQL Injection. The first dataset was used to train the classifier, and the second was used to test the classifier's performance. Based on the tests, the Support Vector Machine gets the highest level of accurate detection. The detection accuracy is 0.9977 with 0,00100 microseconds per query time of the process. The Support Vector Machine classifier can detect 99,37% of the second dataset in performance testing. Not only Support Vector Machine, but the study has also revealed the detection accuracy level of further tested algorithms: K-Nearest Neighbor (0,9970), Logistic Regression (0,9960), Gradient Boosting (0,99477), and Naïve Bayes (0,9754).

There has been a significant increase in the number of cyberattacks. This is not only happening in Indonesia but also in many countries. Thus, the issue of cyber attacks should receive attention and be interesting to study. The Open Web Application Security Project has published the Top-10 website vulnerabilities regarding the explored security vulnerabilities. SQL Injection is still one of the website vulnerabilities that attackers often exploit. This research has implemented and tested five algorithms. They are Naïve Bayes, Logistic Regression, Gradient Boosting, K-Nearest Neighbor, and Support Vector Machine. In addition, this study also uses natural language processing to increase detection accuracy as a part of text processing. Therefore, the primary dataset was converted to the corpus to make it easier to be analyzed. This process was carried out in the feature engineering stage. This study used two datasets of SQL Injection. The first dataset was used to train the classifier, and the second was used to test the classifier's performance. Based on the tests, the Support Vector Machine gets the highest level of accurate detection. The detection accuracy is 0.9977 with 0,00100 microseconds per query time of the process. The Support Vector Machine classifier can detect 99,37% of the second dataset in performance testing. Not only Support Vector Machine, but the study has also revealed the detection accuracy level of further tested algorithms: K-Nearest Neighbor (0,9970), Logistic Regression (0,9960), Gradient Boosting (0,99477), and Naïve Bayes (0,9754). used to exploit these common vulnerabilities may vary in this regard. As an illustration, the attack method used in the injection security gap is SQL Injection (SQLi). In this case, related to OWASP Top-10 Vulnerabilities, SQL Injection is a part of injection vulnerabilities. The most challenging part of cyber attack detection is detecting insider attacks, which are usually seen after the attack is successfully carried out [10]. Based on some data, 70% of illegal hacking is committed from inside rather than outside, but 90% of security controls and oversight are focused on external threats [11]. To perform anomaly detection, machine learning-based systems can be used. The detection system would trigger an alarm when an object or component behaves differently from a predetermined regular pattern. Therefore, the use of machine learning is highly recommended. The machine learning, Dua defines machine learning as a computational process that infers and generalizes a learning model from a given dataset or sample [12]. Cherry states that SQL Injection is an attack carried out via modified SQL queries and sent via browser queries. This type of attack can occur on website-based applications that use Active Server Pages (ASP) or Hypertext Preprocessor (PHP) and use SQL-based data [13], [14]. Kavitha defines SQL Injection as an action when an attacker sends malicious SQL codes to a website application [15], [16]. The attacker can access the database server when SQL malicious codes are successfully executed. Ahmad and Karim explained SQL Injection as a method to steal or access databases illegally [16]. They also stated that SQLi is an act of sending malicious codes to web applications, and it uses the database to execute specific commands. Ogundijo said the SQLi attack uses website form or input to send SQL commands. After successfully attacking, the attacker can access the database server [17]- [19].
Regarding the risk, Hirani et al. stated that an SQLi attack could be hazardous [20]. In some classification, SQLi attack becomes web vulnerabilities with high severity. Roy et al. used Naïve Bayes to detect SQLi attacks with 98,33% accuracy [21].
Related to the previous studies that have been carried out, several researchers use machine learning to detect SQLi attacks. In the case of those earlier studies, the accuracy level obtained by each researcher and algorithms differs. Hashem [22] simultaneously implemented the detection of SQLi attacks. Jemal [23], in his research, has tested nine algorithms to detect SQLi attacks. The algorithms are Naive Bayes, Back Propagation Neural Network, Neural Network Based Model, Neural Network, Decision Tree, Multi-Layer Neural Network, Support Vector Machine, K-Nearest Neighbor, and TBD-NNBr. The accuracy level obtained for each algorithm tested by Jemal can be seen in table 1. In addition, Hasan et al. [24] used Heuristic Algorithm with a 93.8 accuracy level. In quite a different algorithm, Kranthikumar [25] used a REGEX classifier with a 97% of detection accuracy rate.

A. Procedures and Optimization
There are five stages carried out in producing the detection method. First, the dataset is prepared according to needs at the preparation stage, starting from crawling, merging, and so on. The second stage is data preprocessing and data modeling. At this stage, there are several general actions taken. They are (a) eliminating empty or incomplete dataset rows, (b) eliminating duplicate data, and (c) performing data conversion as needed. The third stage is to perform feature engineering, data training, and data testing. At the feature engineering stage, the actions taken are (a) corpus construction and (b) feature construction (see figure 2). The detection method uses the help of the Python NLTK library so that each row of the dataset is converted into a corpus before training and testing. After the classifier is generated, performing performance optimization is next. Each algorithm is tested for getting detection accuracy levels. There are five detection algorithms tested, namely (a) Naïve Bayes; (b) Logistic Regression; (c) Gradient Boosting; (d) Support Vector Machine; and (d) K-Nearest Neighbor. Researchers choose a detection algorithm based on the highest level of accuracy. Then, the researcher also conducted performance testing. That is, the algorithm with the highest level of accuracy is tested to detect new or different dataset rows from the dataset used in the training and testing process. As stated in the previous explanation, feature engineering uses corpus rules. In this case, there are three corpus parameters used they are (a) lower case conversion, (b) alphanumeric filter, and (c) punctuation removal.

B. Machine Learning Algorithms
This study uses NLP-based machine learning. Text recognition or attack patterns are carried out in this machine learning using Python NLTK. After each algorithm is tested, the detection method will choose the algorithm that can achieve the highest accuracy level. This experiment tested five classification algorithms to get the detection method with the highest level of accuracy. The following are the formula used in algorithms. The following formula is Naïve Bayes. This algorithm will make nave assumptions, where all the features that have been determined are seen as independent. In that context, each feature is associated or correlated with the label that appears. The algorithm then implicitly calculates P(feature) so that this algorithm will calculate the numerator for all brands (payload or non-payload). In addition, the following formula used in this study is Logistic Regression.
Logistic regression classification models discrete target variables as a function of multiple feature variables. This classification uses a discrete y variable. For each observation, the probability that y = 1 is modeled as a logistic function over a linear combination of feature values. A label yi will follow the set of features xi. Logistic regression will interpret the probability that the label is in one class as a logistic function of the combination of features. In addition, Gradient Boosting has several types of bases. This experiment uses a decision tree-based Gradient Boosting. This algorithm processes the dataset sequentially because it adds the previous predictor to the ensemble data. With this pattern of work, previous prediction errors are corrected. In this case, the ensemble is defined as a list of predictive decisions generated by machine learning. The dominant class predicts each row of data. The following formula is Gradient Boosting. Support Vector Machine is an algorithm used to determine the decision boundary. The decision boundary determines the classification of this algorithm. The Support Vector Machine utilizes a linear model as a decision boundary. The general form of this process is as follows.
Based on formula 8, w is a weight parameter, (x) is a primary function, and b is a bias. The simplest linear model for the decision boundary is y(x) = wtx + w0, where x is a vector, w is a weight vector, and w0 is a bias. Thus, the decision bounty is defined as y(x) = 0, a dimensionless hyperplane (D-1). Meanwhile, the last algorithm tested is K-Nearest Neighbor. This algorithm uses the Euclidean Distance method to see the closest predicted distance on the defined labels. In general, the Euclidean Distance formula can be described as follows.

C. Evaluation of Detection Method
In addition to using the confusion matrix to further improve the quality of algorithm testing in detecting attack patterns or vectors, researchers take two actions: performance optimization and performance testing. In the performance optimization stage, the researcher tested the five algorithms with different margins, corpus rules, and configuration parameters to obtain a high detection accuracy. Meanwhile, in the performance testing stage, the selected detection algorithm with the highest level of accuracy is tested for its detection quality on different datasets. Unlike most machine learning patterns which generally only rely on the level of accuracy in the training and data testing process, this study applies performance testing to test whether the detection method can detect "foreign" attack vectors from the datasets used in the training process. and testing. The dataset used at this stage is called the challenge dataset.

A. Dataset Characteristics
The total number of SQLi datasets used is 30,904-row datasets. After preprocessing data, there are 295 data lines that are eliminated, so the number of eligible data used in the training and testing data process is 30,609. Similar to the dataset in the XSS attack technique, two labels are used in the SQLi dataset, namely payload, and non-payload. Of the total 30,906 rows of the dataset, the number of rows of data with payload labels is 11,341, and data with non-payload labels is 19,268.

B. Configuration of Detection Parameters
Optimization and evaluation of the performance of detection methods on SQLi attack techniques are carried out in 20 stages. Each stage tested five algorithms. In other words, 100 algorithm testing times have been carried out with different parameter configurations. This process is carried out by setting the margin parameter configuration and corpus rules.

C. Feature Set Injection
In an SQLi attack, the maximum number of vectors with a margin of 6.5 is 104. In the feature set injection stage, as many as 104 character vectors on the payload and non-payload labels are collected into a feature set. The feature set injection used to detect SQLi attacks can be seen in the following figure.

D. Feature Set Injection
Based on the experiments, SVM becomes an algorithm that can detect with the highest level of accuracy, which is 0.99771. In addition, the top generated is also speedy because it is still under microseconds, which is 0.00100 microseconds per query. With this accuracy and ToP, SVM has proven reliable in detecting SQLi attacks.

E. Confusion Matrix, Accuracy Visualization, and ToP
SVM can achieve the highest accuracy level compared to other algorithms. With an accuracy rate of 0.997713, SVM can detect SQLi attacks more accurately. In addition, the top algorithm is also reasonably fast because it only requires 0.00100 microseconds. The following is an SQLi attack data confusion matrix using SVM.

F. Performance Optimization
SVM is the algorithm of choice in the SQLi attack detection method. The choice of this algorithm is, of course, because the basis is clear and firm. Based on the research stages that have been carried out, SVM is proven reliable and achieves a very high level of accuracy. The lowest accuracy rate of SVM during the optimization stage was 0.89. After several optimization steps were carried out, together with other algorithms, SVM achieved the highest level of accuracy, which was 0.9977 with a margin of 6.5. With a margin of 6.5, the required ToP is 0.001. Accuracy optimization and SVM Topp stages can be seen in the following figure.

G. Performance Testing
Given the reality of today's cyberattacks, the patterns and forms of SQLi attacks can vary widely. SQLi attack detection methods based on machine learning must continue to be tested for their performance to produce a detection method that is stable and consistent with its level of accuracy. Therefore, in addition to strengthening the learning process through the stages of comprehensive data training and testing and parameter switcher configuration, this study also applies the stages of performance optimization, which are referred to as challenges. At this stage, the detection method that has been produced is tested for its performance in detecting different datasets. The number of dataset challenges tested on the selected SQLi attack detection method amounted to 33,727 dataset rows. Data preprocessing stages are also carried out on the difficulties dataset. In addition to trying the chosen detection method, this stage also tests other algorithms to obtain performance comparisons and ToP data for each algorithm. The selected SVM algorithm accuracy is