Journal Classification Using Cosine Similarity Method on Title and Abstract with Frequency-Based Stopword Removal 

Piska Dwi Nurfadila(1*), Aji Prasetya Wibawa(2), Ilham Ari Elbaith Zaeni(3), Andrew Nafalski(4),


(1) Electrical Engineering Department, Universitas Negeri Malang
(2) Scopus ID : 56012410400; Dept Electrical Engineering, State University of Malang, Malang
(3) Electrical Engineering Department, Universitas Negeri Malang
(4) University of South Australia
(*) Corresponding Author

Abstract


Classification of economic journal articles has been done using the VSM (Vector Space Model) approach and the Cosine Similarity method. The results of previous studies are considered to be less optimal because Stopword Removal was carried out by using a dictionary of basic words (tuning). Therefore, the omitted words limited to only basic words. This study shows the improved performance accuracy of the Cosine Similarity method using frequency-based Stopword Removal. The reason is because the term with a certain frequency is assumed to be an insignificant word and will give less relevant results. Performance testing of the Cosine Similarity method that had been added to frequency-based Stopword Removal was done by using K-fold Cross Validation. The method performance produced accuracy value for 64.28%, precision for 64.76 %, and recall for 65.26%. The execution time after pre-processing was 0, 05033 second.

Full Text:

PDF

Article Metrics

Abstract view : 673 times
PDF - 67 times

References


G. Orellana, M. Orellana, V. Saquicela, F. Baculima, and N. Piedra, “A text mining methodology to discover syllabi similarities among higher education institutions,” Proc. - 3rd Int. Conf. Inf. Syst. Comput. Sci. INCISCOS 2018, vol. 2018–Decem, pp. 261–268, 2018.

F. Rahutomo, T. Kitasuka, and M. Aritsugi, “Semantic Cosine Similarity,” Semant. Sch., vol. 2, no. 4, pp. 4–5, 2012.

A. I. Kadhim, Y. N. Cheah, N. H. Ahamed, and L. A. Salman, “Feature extraction for co-occurrence-based cosine similarity score of text documents,” 2014 IEEE Student Conf. Res. Dev. SCOReD 2014, pp. 2–5, 2014.

R. T. Wahyuni, D. Prastiyanto, and E. Supraptono, “Penerapan Algoritma Cosine Similarity dan Pembobotan TF-IDF pada Sistem Klasifikasi Dokumen Skripsi,” J. Tek. Elektro, vol. 9, no. 1, pp. 18–23, 2017.

Z. Yao and C. Ze-Wen, “Research on the construction and filter method of stop-word list in text preprocessing,” Proc. - 4th Int. Conf. Intell. Comput. Technol. Autom. ICICTA 2011, vol. 1, pp. 217–221, 2011.

S. M. Babapour and M. Roostaee, “Web pages classification: An effective approach based on text mining techniques,” 2017 IEEE 4th Int. Conf. Knowledge-Based Eng. Innov. KBEI 2017, vol. 2018–Janua, pp. 0320–0323, 2018.

K. Amarasinghe, M. Manic, and R. Hruska, “Optimal stop word selection for text mining in critical infrastructure domain,” Proc. - 2015 Resil. Week, RSW 2015, pp. 179–184, 2015.

A. Mishra and S. Vishwakarma, “Analysis of TF-IDF Model and its Variant for Document Retrieval,” Proc. - 2015 Int. Conf. Comput. Intell. Commun. Networks, CICN 2015, pp. 772–776, 2016.

Z. Xiaoping and S. Honghong, “Research on a VSM-based E-homework Anti-plagiarism System,” pp. 102–105, 2012.

C. Langcai, L. Zhihui, and L. Yuanfang, “Research of text clustering based on improved VSM by TF under the framework of Mahout,” Proc. 29th Chinese Control Decis. Conf. CCDC 2017, pp. 6597–6600, 2017.

B. Trstenjak, S. Mikac, and D. Donko, “KNN with TF-IDF based framework for text categorization,” Procedia Eng., vol. 69, pp. 1356–1364, 2014.

A. Guo and T. Yang, “Research and improvement of feature words weight based on TFIDF algorithm,” Proc. 2016 IEEE Inf. Technol. Networking, Electron. Autom. Control Conf. ITNEC 2016, pp. 415–419, 2016.

R. Premalatha and S. Srinivasan, “Text processing in information retrieval system using vector space model,” 2014 Int. Conf. Inf. Commun. Embed. Syst. ICICES 2014, no. 978, pp. 0–5, 2015.

M. E. Sulistyo, R. Saptono, A. Asshidiq, J. Informatika, and U. S. Maret, “Penilaian Ujian Bertipe Essay Menggunakan Metode Text Similarity,” vol. 12, no. 02, pp. 146–158, 2015.

M. Alodadi and V. P. Janeja, “Similarity in Patient Support Forums: Using TF-IDF and Cosine Similarity Metrics,” Proc. - 2015 IEEE Int. Conf. Healthc. Informatics, ICHI 2015, pp. 521–522, 2015.

I. K. Hadihardaja, M. Cahyono, and I. Soekarno, “A Study of Hold-Out and K-Fold Cross Validation for Accuracy of Groundwater Modeling in Tidal Lowland Reclamation Using Extreme Learning Machine,” pp. 228–233, 2014.

S. Yadav and S. Shukla, “Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification,” Proc. - 6th Int. Adv. Comput. Conf. IACC 2016, no. Cv, pp. 78–83, 2016.

S. Sci, M. Ljumovi, and R. B. Gmbh, “Estimating Expected Error Rates of Random Forest Classifiers : A Comparison of Cross-Validation and Bootstrap,” pp. 212–215, 2015.

J. L. García-balboa, M. V Alba-fernández, F. J. Ariza-lópez, and J. Rodríguez-avi, “Homogeneity Test For Confusion Matrices : A Method And An Example,” pp. 1203–1205, 2018.




DOI: https://doi.org/10.29099/ijair.v3i2.99

________________________________________________________

International Journal Of Artificial Intelligence Research

Organized by: Departemen Teknik Informatika STMIK Dharma Wacana
Published by: STMIK Dharma Wacana
Jl. Kenanga No.03 Mulyojati 16C Metro Barat Kota Metro Lampung
phone. +62725-7850671
Fax. +62725-7850671
Email: info@ijair.id | internationaljournalair@gmail.com | herinurdiyanto@ieee.org 

View IJAIR Statcounter

Creative Commons License
IJAIR is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.