Journal ClassificationÂ UsingÂ Cosine SimilarityÂ MethodÂ on Title and Abstract withÂ Frequency-BasedÂ Stopword RemovalÂ

Piska Dwi Nurfadila; Aji Prasetya Wibawa; Ilham Ari Elbaith Zaeni; Andrew Nafalski

doi:10.29099/ijair.v3i2.99


Journal ClassificationÂ UsingÂ Cosine SimilarityÂ MethodÂ on Title and Abstract withÂ Frequency-BasedÂ Stopword RemovalÂ

⁽¹⁾ Piska Dwi Nurfadila

(Electrical Engineering Department, Universitas Negeri Malang, Indonesia)
^{(2) *} Aji Prasetya Wibawa

(Scopus ID : 56012410400; Dept Electrical Engineering, State University of Malang, Malang, Indonesia)
⁽³⁾ Ilham Ari Elbaith Zaeni

(Electrical Engineering Department, Universitas Negeri Malang, Indonesia)
⁽⁴⁾ Andrew Nafalski

(University of South Australia, Australia)
^*corresponding author

Abstract

Classification of economic journal articles has been done using the VSM (Vector Space Model) approach and the Cosine Similarity method. The results of previous studies are considered to be less optimal because Stopword Removal was carried out by using a dictionary of basic words (tuning). Therefore, the omitted words limited to only basic words. This study shows the improved performance accuracy of the Cosine Similarity method using frequency-based Stopword Removal. The reason is because the term with a certain frequency is assumed to be an insignificant word and will give less relevant results. Performance testing of the Cosine Similarity method that had been added to frequency-based Stopword Removal was done by using K-fold Cross Validation. The method performance produced accuracy value for 64.28%, precision for 64.76 %, and recall for 65.26%. The execution time after pre-processing was 0, 05033 second.

DOI

https://doi.org/10.29099/ijair.v3i2.99

Article metrics

10.29099/ijair.v3i2.99 Abstract views : 2121 | PDF views : 286

Cite

How to cite item

Full Text

Download

References

G. Orellana, M. Orellana, V. Saquicela, F. Baculima, and N. Piedra, â€œA text mining methodology to discover syllabi similarities among higher education institutions,â€ Proc. - 3rd Int. Conf. Inf. Syst. Comput. Sci. INCISCOS 2018, vol. 2018â€“Decem, pp. 261â€“268, 2018.

F. Rahutomo, T. Kitasuka, and M. Aritsugi, â€œSemantic Cosine Similarity,â€ Semant. Sch., vol. 2, no. 4, pp. 4â€“5, 2012.

A. I. Kadhim, Y. N. Cheah, N. H. Ahamed, and L. A. Salman, â€œFeature extraction for co-occurrence-based cosine similarity score of text documents,â€ 2014 IEEE Student Conf. Res. Dev. SCOReD 2014, pp. 2â€“5, 2014.

R. T. Wahyuni, D. Prastiyanto, and E. Supraptono, â€œPenerapan Algoritma Cosine Similarity dan Pembobotan TF-IDF pada Sistem Klasifikasi Dokumen Skripsi,â€ J. Tek. Elektro, vol. 9, no. 1, pp. 18â€“23, 2017.

Z. Yao and C. Ze-Wen, â€œResearch on the construction and filter method of stop-word list in text preprocessing,â€ Proc. - 4th Int. Conf. Intell. Comput. Technol. Autom. ICICTA 2011, vol. 1, pp. 217â€“221, 2011.

S. M. Babapour and M. Roostaee, â€œWeb pages classification: An effective approach based on text mining techniques,â€ 2017 IEEE 4th Int. Conf. Knowledge-Based Eng. Innov. KBEI 2017, vol. 2018â€“Janua, pp. 0320â€“0323, 2018.

K. Amarasinghe, M. Manic, and R. Hruska, â€œOptimal stop word selection for text mining in critical infrastructure domain,â€ Proc. - 2015 Resil. Week, RSW 2015, pp. 179â€“184, 2015.

A. Mishra and S. Vishwakarma, â€œAnalysis of TF-IDF Model and its Variant for Document Retrieval,â€ Proc. - 2015 Int. Conf. Comput. Intell. Commun. Networks, CICN 2015, pp. 772â€“776, 2016.

Z. Xiaoping and S. Honghong, â€œResearch on a VSM-based E-homework Anti-plagiarism System,â€ pp. 102â€“105, 2012.

C. Langcai, L. Zhihui, and L. Yuanfang, â€œResearch of text clustering based on improved VSM by TF under the framework of Mahout,â€ Proc. 29th Chinese Control Decis. Conf. CCDC 2017, pp. 6597â€“6600, 2017.

B. Trstenjak, S. Mikac, and D. Donko, â€œKNN with TF-IDF based framework for text categorization,â€ Procedia Eng., vol. 69, pp. 1356â€“1364, 2014.

A. Guo and T. Yang, â€œResearch and improvement of feature words weight based on TFIDF algorithm,â€ Proc. 2016 IEEE Inf. Technol. Networking, Electron. Autom. Control Conf. ITNEC 2016, pp. 415â€“419, 2016.

R. Premalatha and S. Srinivasan, â€œText processing in information retrieval system using vector space model,â€ 2014 Int. Conf. Inf. Commun. Embed. Syst. ICICES 2014, no. 978, pp. 0â€“5, 2015.

M. E. Sulistyo, R. Saptono, A. Asshidiq, J. Informatika, and U. S. Maret, â€œPenilaian Ujian Bertipe Essay Menggunakan Metode Text Similarity,â€ vol. 12, no. 02, pp. 146â€“158, 2015.

M. Alodadi and V. P. Janeja, â€œSimilarity in Patient Support Forums: Using TF-IDF and Cosine Similarity Metrics,â€ Proc. - 2015 IEEE Int. Conf. Healthc. Informatics, ICHI 2015, pp. 521â€“522, 2015.

I. K. Hadihardaja, M. Cahyono, and I. Soekarno, â€œA Study of Hold-Out and K-Fold Cross Validation for Accuracy of Groundwater Modeling in Tidal Lowland Reclamation Using Extreme Learning Machine,â€ pp. 228â€“233, 2014.

S. Yadav and S. Shukla, â€œAnalysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification,â€ Proc. - 6th Int. Adv. Comput. Conf. IACC 2016, no. Cv, pp. 78â€“83, 2016.

S. Sci, M. Ljumovi, and R. B. Gmbh, â€œEstimating Expected Error Rates of Random Forest Classifiers : A Comparison of Cross-Validation and Bootstrap,â€ pp. 212â€“215, 2015.

J. L. GarcÃa-balboa, M. V Alba-fernÃ¡ndez, F. J. Ariza-lÃ³pez, and J. RodrÃguez-avi, â€œHomogeneity Test For Confusion Matrices : A Method And An Example,â€ pp. 1203â€“1205, 2018.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

________________________________________________________

The International Journal of Artificial Intelligence Research

Organized by: Prodi Teknik Informatika Fakultas Teknologi Bisnis dan Sains
Published by: Universitas Dharma Wacana
Jl. Kenanga No. 03 Mulyojati 16C Metro Barat Kota Metro Lampung

Email: jurnal.ijair@gmail.com

View IJAIR Statcounter

This work is licensed under Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me