Journal Classification Using Cosine Similarity Method on Title and Abstract with Frequency-Based Stopword Removal

There are a lot of text mining methods that can be used to classify documents, including K-NN, Jaccard, and Cosine Similarity. Understanding the K-NN method can be seen in the literature [1]. while the understanding of the Jaccard method can be seen in the literature [2]. The third understanding of the Cosine Similarity method can be seen in the literature [3]. The cosine similarity method is a method that produces the highest performance value compared to the K-NN and Jaccard methods [4]. This can occur because the cosine similarity value between two vectors depends on the number of word frequencies of the test and training documents [5]. This method uses the normalization concept of vector length by comparing word frequency between two documents so that it can produce high accuracy values [6]. Before being classified, the document will go through the pre-processing stage.

This research focuses on the stopword removal pre-processing technique on the economic dictionary with frequency-based. The application of stopword removal based on the economic dictionary aims to eliminate important words related to the world economy. This stage is done so that the word is not registered in the frequency-based stopword removal dictionary anymore. Whereas, frequency-based stopword removal dictionary is made to make the term with certain frequency documents that appear in the document as a stopword dictionary because the term with a certain frequency is assumed to be not important words and it will provide less relevant results on the calculation

II. Method
The classification of Artikel jurnal ekonomi by adding frequency-based stopword removal was carried out in this study. The stages consisted of collecting datasets, pre-processing to change the data to be more structured, calculating VSM-based Cosine Similarity values, and testing algorithms using k-fold cross-validation.

A. Research Dataset
The dataset used in this study is Artikel Jurnal Ekonomi Universitas Negeri Malang of Indonesian language. The datasets were collected in March 2019 consisting a total of 126 data containing titles and abstracts. The datasets of Artikel Jurnal Ekonomi were grouped into four labels;

B. Pre-processing
The function of pre-processing can be seen in the literature [15]. There were three (3) stages of pre-processing in this study; (1) case folding, (2) stopword removing based on economic dictionary, and (3) frequency-based stopword removing.
The second stage was a stopword removing based on the economic dictionary that served to eliminate words related to the word economy. Thus, important words related to the word economy were not included in the stage of making frequency-based stopword removal dictionaries. The Economic Dictionary used consisted of 331 words. The steps taken were: • Tokenizing documents and dictionaries • Matching all words in the documents with the words in the economic dictionary o If the word in the document is the same as the word in the economic dictionary, the word in the document will be deleted, o If the word in o the document is not the same as the economic dictionary, it can be assumed that the word will not affect the classification process.
• They are recombining the decapitated word into a complete sentence.
The third stage was frequency-based stopword removal, which functioned to delete words on the test document based on the frequency term. The important word related to economics had already been described in the previous process so that the word was not included in the list of frequencybased stopword removal dictionary. The steps taken were: • Counting the number of the terms' occurrences of the training document.
• Building stopword removal dictionaries based on the frequency terms in training documents.
• Decapitating sentences in test documents and dictionaries based on tokenizing.
• They are matching the frequency-based stopword removal dictionary with terms contained in test documents. o If the term is the same as the frequency-based stopword removal dictionary, the term in the test document will be deleted. o The term is assumed to be an important word that will influence the classification process. • They are recombining the decapitated words into a complete sentence.

C. VSM Approach
At this stage, the document was represented by a vector using the VSM approach. The definition of the VSM approach can be seen in the article [16]. The function of VSM is to convert documents into numbers so that we can calculate the weight [17]. Each different word term will be represented by , whereas d is the appropriate weight in the document d [18]. With VSM approach, the calculation of weight from each term in the training document and test documents was carried out using the TF-IDF weighting method. The TF-IDF has the main ideas that can be seen in the literature [19]. To determine the value of TF-IDF, two elements were used; TF and IDF. There were three (3) stages to determine the value of TF-IDF weighting, namely: • Calculation of TF (Term Frequency) It was done to calculate the frequency of term i in document j [20]. The formula for calculating TF can be seen in the literature [15]. • IDF Document (Inverse Dokument Frequency) IDF reflected the distribution of terms contained in the literature [19].
The TF-IDF value was obtained by combining both values of TF and IDF. The TF-IDF weighting scheme can be seen in the literature [21] [22].
Whereas, to classify documents using the Cosine Similarity method. Cosine Similarity method uses a calculation based on a vector space similarity measure. The similarity value between two documents stated in two vectors using keywords from a document [2]. The equation for calculating cosine similarity can be seen in the literature [23].
The output of the Cosine Similarity method is a similarity value with a range of zero to one. If the similarity value is closer to one, it means that the level of document similarity is high. Conversely, if the similarity value is close to zero, it means that the level of similarity between the two documents is low [24].

D. Testing Method
The stage of testing the Cosine Similarity method was carried out in two (2) stages, namely: The definition of the K-Fold Cross Validation method can be seen in the article [25]. The way the Cross Validation method works was by dividing the data into almost the same set of k parts. In each repetition, a set of k was used as test data and the remainder was used as training data. The process was repeated as many as k until all the data alternately changed into random test data [26] [27]. The output of this step was a k estimate of the test error which was then averaged to get the estimated value of the expected testing error [28].

• Confusion Matrix
The definition of confusion matrix can be seen in the literature [29]. At this stage there is an accuracy test of the algorithm used to classify the data. Accuracy test was done by using the confusion matrix method. Testing was done using equations: o Accuracy The value of the method accuracy was obtained by dividing the number of true documents to true value with the number of all classified documents [11] [30].
o True Positive Rate (Recall) Recall was done through the calculation of the ratio of true positive. The recall calculation formula can be seen in the literature [15].

o Presision
Presision was calculated from the ratio of the amount of data in the true dataset that is true positive to the number of true positive data and the number of false negative data. The precision calculation formulas can be seen in the literature [15].

III. Result
In the tests that have been done by removing several different word frequencies, the comparison results of the number of words before, after stopword removal with tala dictionary and after frequency-based stopword removal are obtained. A comparison of the number of words can be seen in Table 3. Table 3 displays the comparison of the number of words before, after stopword removal with tala dictionary and after frequency-based stopword removal with a specified frequency limit of less than 60. Based on Table 3, the number of words before frequency-based stopword removal was 56 words. After stopword removal with tala dictionary the remaining 32 words. We can assume that because only deleted words are basic words. And after being executed with frequency-based stopword removal with a specified frequency limit, only 25 words remain. This can be assumed because the words that are displayed are not just basic words. Not only for less than 60 frequencies, testing was done by removing frequencies that are less than 30, 40, 50, and 70. Comparison of the remaining words from each frequency can be seen in Table 4. Table 4 displays the comparison of the remaining words before and after stopword removal based on the specified frequency. Based on Table 4, the difference in the number of words remaining from each set frequency limit. This can be assumed because the number of dictionaries on each boundary frequency varies. So that it can affect the number of words remaining after being matched with a dictionary frequency based on the prescribed limits.
In addition to knowing the number of words remaining, the purpose of this study is to know the value of accuracy, precision, recall and the results. The results of the confusion matrix can be seen in Table 5. Table 5 displays an example of confusion matrix by removing terms that have a frequency of less than 60. Not only term with frequency > 60 was removed, terms with frequencies less than 30, 40, 50, and 70 were deleted. The results comparison of accuracy, precision and recall can be seen in Table 6. Table 6 displays the result of testing performance based on the frequency that was deleted. Based on Table 6, the highest number of accuracy results from deletion of frequency that is less than 60. Frequency less than 60 is used as a treshhold value. Whereas, accuracy decreases when terms with frequencies less than 30, 40, 50, and 70 are deleted. We can assume that when the term with frequency < 30 is deleted, the deleted term becomes too little so the accuracy value decreases. Meanwhile, we can assume that the accuracy result of term removal with frequency < 70 decreases because too many terms are deleted causing a decrease in accuracy value.
The results of this study can be compared with the values of accuracy, precision and recall testing of the cosine similarity method. Table 7 displays the difference between the accuracy comparison of the Cosine Similarity method with stopword removal with Tala dictionary and the accuracy of the Cosine Similarity method that has been combined with frequency-based stopword removal. Based on Table 7, the value increases of accuracy, precision and recall are 2.91%, 4.58% and 0.74%. It seems that the increase in accuracy value is still not significant. This is because there are still too many words left after the frequency-based stopword removal stage that can affect the document classification process.
In addition to performance accuracy, precision, and recall, the execution time of the classification process was compared as well. Comparison of the execution time can be seen in Table 8. Table 8 displays the comparison of the execution time of the Cosine Similarity method which has been combined with stopword removal based Tala dictionary with the Cosine Similarity method which has been combined with frequency-based stopword removal. Based on Table 8, the required execution time in pre-processing the Cosine Similarity method with stopword removal based Tala dictionary is faster; 0.650 s. It can happen because there are not many pre-processing steps in the basic method. Meanwhile, the Cosine Similarity method with frequency-based stopword removal requires an execution time of 61,6266 s. The execution time at the combined pre-processing stage is longer because more stages are carried out. However, the execution time in the classification of the combined Cosine Similarity method with frequency-based stopword removal is faster because the number of words is matched slightly. The execution time needed is only 0.05033 s so that it can speed up the classification process. Meanwhile, the execution time required for classification in the basic Cosine Similarity method is longer. It happens because the number of words that need to be matched are a lot. The execution time required is 0.791 s.

IV. Conclusion
This study concludes that adding frequency-based stopword removal can improve the performance of the Cosine Similarity algorithm. This study resulted in accuracy value of 64.28%. Compared with the previous research which produced accuracy value of 62.70%, the accuracy increase in this study was approximately 2%. Meanwhile, the execution time is needed when the classification process is faster, which is 0.05033 s. However, the results of this study are considered to be less than optimal. It happens because the term frequency is not evenly distributed so that an increase in the value of accuracy is still not optimal. Therefore, the researchers suggest adding stemming to future studies.