Analysis of the Similiarity Level of Source Code in the Kotlin Programming Language using Winnowing Algorithm

ABSTRACT


Introduction
Plagiarism is an act of imitating the work of others directly or indirectly.In the academic environment, plagiarism usually occurs in textual documents such as essays, reports, and even research.However, plagiarism does not only apply to textual records but also to source code documents.Source code plagiarism in academia usually occurs when students copy another student's code and submit it as if it were the student's work [1].
According to the Kamus Besar Bahasa Indonesia (KBBI), plagiarism or better known as plagiarism, is taking someone else's essay (opinion and so on) and making it appear as if it were your essay.In the Regulation of the Minister of National Education of the Republic of Indonesia No. 17 of 2010, article 1 defines plagiarism as an act intentionally or unintentionally in obtaining or trying to obtain credit or value for scientific work by quoting part or all of the work or scientific works of other parties that are recognized as scientific works, without citing sources accurately and adequately..In computer science or informatics, source code is essential for programmers because source code is central to building a system or application.In making applications, Plagiarism is an act of imitating the work of others directly or indirectly.In an academic environment, plagiarism applies not only to textual documents but also to source code documents.Source code plagiarism in academia usually occurs when students copy another student's code and submit it as if it were the student's work.So that an automatic plagiarism check is needed, the winnowing algorithm will be used to help detect similarities in source code as a way to detect an act of plagiarism.The Winnowing algorithm, which is usually used to detect document plagiarism, this research detects the source code.The results produced in this study are that the degree of similarity in the two source codes will produce different similarity values if the dataset used has gone through the text preprocessing stage or without preprocessing.If the dataset has gone through the text preprocessing stage, the similarity value will be pretty low because the number of characters used is significantly reduced.The Winnowing and Jaccard Similarity algorithms quickly detect plagiarism in source code and can be used to minimize plagiarism.the test set, and when compared to SVM, the model using XGBoost is better in the dataset used.
Research by [5] used the winnowing algorithm to detect plagiarism in programming source code by using ten student assignments using the winnowing algorithm and obtaining various similarity values by comparing two tasks randomly.And it can be concluded that the parameter values k and w significantly affect the resulting similarity value, namely the greater the k and w value, the smaller the resulting similarity value, and the smaller the k and w value, the greater the resulting similarity value.So the value of k = 2 is used to increase the chances of finding similarities between the two samples being compared, considering that the winnowing algorithm works by checking each code tested.
The plagiarism detection research conducted by [9] used the Mamber and Winnowing algorithms on abstract text documents.The dataset used is a conceptual document from students uploaded to the internet.In its implementation, this study uses two approaches, namely the Biword and Triword approaches embedded in the winnowing algorithm, while Biword is for the Manber algorithm.The testing results with ten documents showed that the average similarity of the mamber algorithm is 90.56% while the winnowing algorithm is 94%.And for the winnowing algorithm with the triword approach produces an accuracy of 91.22%.So the biword process and the winnowing algorithm are better than the Manber algorithm and the winnowing algorithm with the triword method.
Another research conducted by [10] was motivated by the fact that manually checking source code plagiarism is a repetitive, complex, and time-consuming task, so automation is needed to detect plagiarism in source code that has high-quality.The dataset used in this study uses the Java programming language collected from Petra Christian University programming classes.This study uses three main algorithms: Levenshtein distance, greedy string tiling, and bigram, producing 12 features and statistical features.Then in the final step, features will be used to process training and inference with the XGBoost model.The test results show that using the proposed features and preprocessing has better performance metrics than previous research, namely an f1-score of 99%.The preprocessing application can also improve performance metrics based on previous studies' proposed features.
Winnowing is an algorithm based on a hashing approach that applies a hash function and window formation to obtain fingerprints when matching patterns.This algorithm is used by [11] based on words (word-level) still needs to be done, so this study aims to measure the level of similarity of dishes using the Winnowing algorithm and word-level trigrams.The results showed that the Winnowing algorithm applied using word-level trigrams could detect similarities in the text by 76.84%, 52.29%, 37.40%, and 19.29%.From the research results, the pattern-matching method with the Winnowing algorithm and word-level trigrams can be used to measure the level of text similarity.
The research conducted [12] found that the determination of the best parameters using the winnowing algorithm was based on the smallest value of the difference in parameters (differences in the results of similarity based on parameters), and the highest similarity value was the result of the similarity dice and the Jaccard coefficient.The results of setting the best research parameters with hash = 5, k-gram = 2, window = 7.The results of testing these parameters indicate that the higher the k-gram value will affect the results of the similarity value.And if the hash or window value is taller, the change in result similarity is not too large but significant enough to affect the differences between parameters.Testing the value of the similarity level using the Jaccard similarity in the winnowing algorithm is lower than the dice coefficient with the difference between the dice similarity and the Jaccard coefficient of 2.554683%.
Based on some of the research above, this study will conduct experiments in detecting plagiarism.This study uses the Winnowing Algorithm, the winnowing Algorithm was applied by comparing the results when using the Kotlin grammar preprocessing and not preprocessing.Thus, it can produce a better model against various plagiarism attacks and performs tests using Jaccard similarity.In conducting this research, the winnowing Algorithm will be used to help see similarities in source code which will be used as a way to detect the presence of an act of plagiarism.The Jaccard similarity is required as a document fingerprint, and the Algorithm that will be used to support it is the winnowing Algorithm.This is by statement [7] Jaccard similarity is usually used to compare documents and calculate the similarity value of two objects or documents.

Method
This research was conducted to detect plagiarism in source code using the Winnowing Algorithm.
The research stage can be seen in Figure 1.

Dataset
The dataset is obtained from GitHub with the Kotlin programming language with the same theme.Then manually select the source code that has a code that is close to plagiarism or not.Then the data will be processed using text preprocessing using stemming and tokenizing.There are two types of datasets used, there are dataset that has been processed with text preprocessing and with preprocessing.The two data are then processed with the winnowing algorithm

Pre-processing
Stemming removes parts that tend to be static and are the same for all code, such as package declarations, class declarations, and main function declarations.This method is done so that the general part does not affect the assessment of whether a source code is plagiarized.Tokenizing functions to group each token from the source code into several categories.These categories are whitespace, comments, strings, operators, keywords, functions, variables, and numbers.The tokenizer used in this study uses the Pygment library, pygment is used to highlight the syntax of particular programming languages.Inside the pygment library is a library called KotlinLexer which functions for generating tokens and parsing Kotlin code in your program using the Kotlin grammar.The grouping of token Vol.7 categories is taken from the Kotlin Programming Language specification.Then we will form k-grams and calculate the value using the rolling hash.

Proposed Work
The process begins with collecting data via Github.The data collected is the source code of the Kotlin programming language with the theme "Creating a Github User App application" and taking one of the files with the .ktextension.
Then data processing is carried out, and the data is manually selected source code with a code close to plagiarism or not.Then the data will be processed using text preprocessing using stemming and tokenizing.Stemming removes parts that tend to be static and are the same for all code, such as package declarations, class declarations, and main function declarations.This is done so that the general part does not affect the assessment of whether a source code is a plagiarism-tokenizing functions group each token from the source code into several categories.
After processing the data, the data is used to detect plagiarism in the source code using the Winnowing algorithm.
Here is how the Winnowing algorithm works [13]: 1. Build a k-gram series from the text 2. Perform a hash function for each gram.Equation ( 1) is the calculation of the hash function of the winnowing algorithm 3. Creates sets called windows consisting of i hash values.If i = 6, then in one window, there are 6 hash values 4. Selecting the fingerprint from the hashing results by dividing the hash results based on one window w value, and then selecting the smallest hash value from each window.
The fingerprint results obtained by comparing the two source codes will be tested using Jaccard similarity.

Evaluation
Testing by calculating similarity using Jaccard similarity.Two experiments were conducted, namely, the fingerprint results using preprocessing and not preprocessing.The fingerprint results will be compared using the Jaccard similarity equation.The Jaccard similarity value is obtained from the intersection divided by the union of the two sets.Jaccard distance is a measure of dissimilarity between data sets.This can be determined by the inverse of the Jaccard coefficient, obtained by removing the Jaccard similarity from the Jaccard similarity value.The advantage of Jaccard similarity is that it calculates the number of terms that are the same in each sentence and compares it to the total number of terms in both sentences.Jaccard similarity can be formulated as follows [13]: (2) Details:

Results and Discussion
The trial results of the plagiarism detector using Jaccard Similarity or the Jaccard Coefficient for the winnowing algorithm are shown in Table 1 using different datasets.The dataset used is the exact same source code which is the source code with the same program code, source code that is close to the same is the source code which is differentiated in the function calls and initialization of the variables, and the source code which has a big difference is the source code which has different calling activities.
The following are the stages of implementing the Winnowing algorithm which are described in the following test.

Source code 1
Source code 2

Preprocessing
In this process stemming and tokenizing are carried out which function to eliminate common parts so that they do not affect the assessment of whether a source code is plagiarism or not.In Figure 3.1 is the source code being processed text preprocessing

Fingerprint selection of each window
In this step, take the smallest hash value from the Window series which is called the Fingerprint.The following is the result of fingerprint selection in source code 1 and 2. The source code to select fingerprint of each window on Figure 3.5 Table 1 explains the effect of n-grams and w-grams on the similarity results.The testing results on two samples or documents are in table 1 with a value of k-gram = 4 and w-gram = 3.The reason for choosing the value of k = 4 is because the greater the value of k, the greater the similarity of values obtained.The use of preprocessing greatly affects the similarity value.Table 1, for the same source code, has a 100% similarity value and can be considered the proposed method to be successful because there is no difference, so the kgram values are the same and also produce the same hash value.In table 1, the similarity value using preprocessing is lower than without preprocessing.This can happen because the number of characters created by the data that has been preprocessed is less.Tests in table 1 numbers 1, 2, and 3 produce high similarity.Lower than the test in number 1.The k-gram and w-gram values also influence this because the smaller the k-gram and w-gram values, the more often the pieces will be matched and usually found.Otherwise, the seldom is the data compared or found.
The applied preprocessing (removing some standard components and grammar tokenization) has the benefit of increasing the performance of metrics for calculating fingerprints in this study and research [8].Using preprocessing tokenization with grammar will speed up the fingerprint calculation process because fewer letters are processed.Without preprocessing, it will cause parts often repeated/ templates in a code to have a more dominating effect than the changes made.The use of stemming affects the accuracy of the resulting similarity value.Using stemming produces fewer good scores than those without stemming [3].
The Winnowing algorithm has k-grams, and the window process can be changed.The higher the k-gram value, the lower the similarity value is produced, and the lower the kgram value, the higher the similarity value.
But that doesn't mean a low k-gram value will give an accurate accuracy value.The smaller the k-gram value, the smaller the characters to match and the more often these characters are found in the text.The following is an example of the results of source code detection, and the author adds a threshold variable that functions to provide a minimum value to determine whether the source code is indicated as plagiarism or not.

Conclusion
From the research results above, it can be understood that text preprocessing greatly influences similarity results.The degree of similarity in the two source codes will produce different similarity values if the dataset used has gone through the text preprocessing stage or without preprocessing.If the dataset has gone through the text preprocessing step, the similarity value will be low because the number of characters used is significantly reduced.When compared, the value of similarity by the dataset without using preprocessing is higher than the dataset that has been done preprocessing.

Figure 3 . 1 .
Figure 3.1.Source code for text preprocessing processThe results after preprocessing are as below:Source code 1

Figure 3 . 2
Figure 3.2 Source code for estabilishment the K-Gram Yustikamasy A et.al (Analysis of the Similiarity Level of Source Code in the Kotlin Programming Language using Winnowing Algorithm)

Table 1 .
Plagiarism Detection Trial Using Different Datasets Against Similarity Results with the Jaccard Similarity Method