Comparison Analysis of K-Nearest Neighbor and Naïve Bayes in Determining Talent of Adolescence

.


I. Introduction
Adolescence is a transition period from childhood to adulthood. The nature of childhood is still inherent in him and consideration of maturity has not been fully formed, adolescence in his consideration is still looking for identity to shape his personality character. In this period the child reaches physical maturity and is expected to be accompanied by emotional maturity and social development. Commitment interaction, deep exploration, and reconsideration of commitments with different identity statuses will play a role in their identity search. The extent to which adolescents find a stable identity is closely related to their psychosocial functioning and well-being This period lasts from around 12 years to 20 years [1] Teenagers are looking for "who I am, what is my role?" In finding their identity, that is knowing their personal needs and goals to be achieved in their lives, then developing teen interests and talents becomes an important issue. In developing their competencies, adolescents still need guidance from parents and the home and school environment. Parental guidance can help in exploring its potential so that it can optimally explore its intelligence. In explaining intelligence, Gardner uses the word talent or talent. Gardner revealed that there are 8 different types of intelligence in each person, namely linguistic intelligence, visual spatial, kinesthetic, musical, intrapersonal, interpersonal, logicalmathematical, and natural.
To determine the interests and talents of children can be known with the help of experts namely child psychologists. But now there is still reluctance from parents to discuss their child with a psychologist, they think they can handle it themselves. In addition, economic factors are a problem with the high cost of consulting psychologists. The number of child psychologists is also not comparable with the rapid population growth. With the development of technology, tests of interest and talent can be helped by technological tools in accordance with the rules of psychology.

A. Data Collection
The training data used in this study were taken from 350 existing data and testing data were taken from the results of questionnaires given to children aged 10-18 as many as 148 children. Training data and data testing have the same number of attributes obtained from questioner psychology. All datasets will be selected to get the 17 relevant attributes. From 17 attributes, 8 attributes are used as data input to the classification. The target of the classification is interest talent attribute in Table 1.The data attributes used in this study can be seen in Table 1:

B. Classification
The classification algorithm used in this paper is naïve bayes that utilizes simple probabilistic in the data mining and K-Nearest Neighbor process which classification is based on analogy, namely comparing testing data with training data that is close to the object in the test data and has similarities with the testing data. a) Application of K-Nearest Neighbor K-Nearest Neighbor Algorithm: i) Determine the parameter K (number of closest neighbors). The parameter K is K = 5.
ii) Calculates the square of the Euclid distance (query instance) of each object against the sample data provided. Euclidean Formula: iii) Then sort the objects into groups that have the smallest Euclid distance. iv) Collect Y category (Nearest Neighbor Classification). v) By using the most majority Nearest Neighbor category, we can predict the calculated query instance value.

b) Application of Naive Bayes
The flow of the Naive Bayes method is as follows: i) Read training data ii) Calculate the Amount and probability, but if the numerical data is: 1. Look for probability values by calculating the appropriate amount of data from the same category divided by the data in that category Interest and Talent Criteria and Probabilities ) (  iii) Obtain values in the mean table, standard deviation and probability.
In this testing phase, several experimental scenarios will be conducted to determine which classification model is accurate to determine children's interests and talents between the Naïve Bayes and K-Nearest Neighbor methods. The evaluation used is accuracy with the experimental scenario data validation using RapidMiner software, will be discussed further in the results and discussion.

III. Result
After analyzing the data by processing the data that has been obtained, namely training data and testing data. This training data was obtained from psychologist data which was initially a questionnaire that had been processed by a psychologist and interest in his talents was known. While the testing data is data obtained from questionnaires or in the new data set which is filled with children aged 10-18 years.
The data that has been obtained will be analyzed by the K-Nearest Neighbor and Naive Bayes algorithm by conducting comparative analysis in determining children's interests and talents. Data processing is in accordance with the stages of the 2 model algorithm by finding children's interests and talents in the testing data from the questionnaire using previously obtained data from psychologists. The discussion about data processing with the two algorithm models contained in the results has been obtained by processing the data using the data mining process with the K-NN and Naive Bayes algorithm models for testing data. Results of testing data for K-Nearest Neighbor and Naive Bayes. Comparative analysis in determining the interests and talents of children with these two algorithms will determine the value of data accuracy. With the accuracy of the data it can be seen that the algorithmic model of the two methods is accurate in determining children's interests and talents. The following are the results of testing with RapidMiner on training data, data testing, and a combination of training data & data testing.
A. Data Accuracy Testing with K-Nearest Neighbor a) Testing training data Testing the accuracy of the data performed on training data to obtain value accuracy. Testing the accuracy of training data can be seen in the table as a confusion matrix result of testing the accuracy of the data using RapidMiner software. These training data are data that have been obtained previously, namely data on interests and talents obtained from expert psychologists and known interests and talents of children. The accuracy value obtained by the RapidMiner software with the K-Nearest Neighbor model on training data is worth 42.22%. Based on the ROC Curve, the value of accuracy obtained has a level with a diagnosis of failure or failure. b) Testing testing data Testing the accuracy of the data performed on testing data to obtain value accuracy. Testing the accuracy of testing data can be seen in the table the confusion matrix results of testing the accuracy of the data using RapidMiner software. This testing data is data that has been obtained from the results of a questionnaire filled with children aged 10-18 years. The accuracy value obtained RapidMiner software by the with the model K-Nearest Neighbor on testing data is worth 43.33%. Based on the ROC Curve, the value of accuracy obtained with levels of diagnosis is poor classification or worse classification results. However, for computerbased systems, an accuracy value of <60% is acceptable.

c) Testing training data and testing data
Testing the accuracy of the data performed on training& data testing data to obtain value accuracy. Data on training& data testing which carried out testing data accuracy amounted to 400 records consisting of 300 training data and 100 testing data. The training& data testing data are data that have been obtained from data that already exists in psychologists and the results of questionnaires filled with children aged 10-18 years. In the table the confusion matrix results of testing the accuracy of the data using RapidMiner software. The accuracy value obtained Rapid Miner software with the model K-Nearest Neighbor on training& data testing is worth 55.00%. Based on the ROC Curve, the value of accuracy obtained has a level with a diagnosis of failure or failure.

B. Data Accuracy Testing with Naïve Bayes a) Testing training data
Testing the accuracy of the data performed on training data to obtain value accuracy. Testing the accuracy of training data can be seen in the table as a confusion matrix result of testing the accuracy of the data using RapidMiner software. These training data are data that have been obtained previously, namely data on interests and talents obtained from expert psychologists and known interests and talents of children. Table 7.
Training dan Tabel Testing, Tabel Distance The accuracy value obtained RapidMiner software bywith themodel Naive Bayeson training datais worth 46.67%. Based on the ROC Curve, the value of accuracy obtained has a level with a diagnosis of failure or failure.
i) Testing testing data Testing the accuracy of the data performed on testing data to obtain value accuracy. Testing the accuracy of testing data can be seen in the table the confusion matrix results of testing the accuracy of the data using RapidMiner software. Testing. This data is data that has been obtained from the results of a questionnaire filled with children aged 10-18 years. The accuracy value obtained with RapidMinerwith themodel Naive Bayeson testing data is worth 40.00%. Based on the ROC Curve, the value of accuracy obtained with levels of diagnosis is poor classification or worse classification results. ii) Testing traning data and testing data Testing the accuracy of the data performed on training& data testing data to obtain value accuracy. Data on training & data testing which carried out testing data accuracy amounted to 400 records consisting of 300 training data and 100 testing data. The training& data testing data are data that have been obtained from data that already exists in psychologists and the results of questionnaires filled with children aged 10-18 years. The following table is a table of confusion matrix results from testing the accuracy of data using RapidMiner software. The accuracy value obtained with RapidMiner with the model Naive Bayes on training & testing is worth 51.67%. Based on the ROC Curve, the value of accuracy obtained has a level with a diagnosis of failure or failure.
After going through the testing and evaluation process, a more accurate model is obtained to measure children's interests and talents. The model is used to evaluate training data or testing data.
Data accuracy test results using RapidMiner software with K-Nearest Neighbor and Naive Bayes models have accuracy values that can be seen in Table 10. Comparison of accuracy values between the two models by testing training data, testing data, and combined training & data testing data as indicated in Table 10: From Table 10. there can be seen a comparison of two models, namely K-Nearest Neigbor and Naive Bayes in obtaining data accuracy values. Testing the training data of the two models can be compared to the Naive Bayes model which is of higher value than the K-Nearest Neighbor with a value of 46.67%, testing the testing data obtained by the K-Nearest Neighbor model which is higher in value than Naive Bayes with a value of 43.33% , and a combined test of training & data testing data obtained a higher K-Nearest Neighbor model than Naive Bayes with a value of 55.00%. However, the accuracy used for a system is the accuracy of the testing data.
From Table 10, the accuracy of testing data with the K-Nearest Neighbor algorithm worth 43.33% is higher than the accuracy of the testing data with the Naive Bayes algorithm, while the accuracy of training and testing data with the K-Nearest Neighbor algorithm worth 55.00% is higher rather than the accuracy of training and testing data with the Naive Bayes algorithm, so it was concluded that for this study, the K-Nearest Neighbor algorithm is more accurate for classifying children's interests and talents than the Naive Bayes algorithm.

IV. Conclusion
This study determined the interests and talents of children aged 10-18 years with data testing consisting of 100 records using the K-Nearest Neighbor and Naive Bayes algorithms with references from training data that is pre-existing data and obtained accurate algorithms. In knowing the algorithm that is accurate in determining the interests and talents of children, it can be seen from the accuracy of the data with the confusion matrix using RapidMiner software on training data, testing data, and a combination of training data & data testing. This study concluded that the K-Nearest Neighbor algorithm is better than Naive Bayes in terms of classification accuracy.
The suggestions for further research regarding the comparative analysis of K-NN and Naive Bayes in determining children's interests and talents so that this research becomes more developed, namely: This research can be developed by comparing with other algorithms, so that the best algorithm can be determined in determining children's interests and talents. This research can be developed with the aim of determining the interest and talent of early childhood. This research can also be developed using other classification algorithms.