Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method

Fauzi Adi Rafrastara; Catur Supriyanto; Cinantya Paramita; Yani Parti Astuti; Foez Ahmed

doi:10.30591/jpit.v8i2.5207

Performance Improvement of Random Forest Algorithm for Malware Detection on Imbalanced Dataset using Random Under-Sampling Method

Fauzi Adi Rafrastara, Catur Supriyanto, Cinantya Paramita, Yani Parti Astuti, Foez Ahmed

Abstract

Handling imbalanced dataset has their own challenge. Inappropriate step during the pre-processing phase with imbalanced data could bring the negative effect on prediction result. The accuracy score seems high, but actually there are many problems on recall and specificity side, considering that the produced predictions will be dominated by the majority class. In the case of malware detection, false negative value is very crucial since it can be fatal. Therefore, prediction errors, especially related to false negative, must be minimized. The first step that can be done to handle imbalanced dataset in this crucial condition is by balancing the data class. One of the popular methods to balance the data, called Random Under-Sampling (RUS). Random Forest is implemented to classify the file, whether it is considered as goodware or malware. Next, 3 evaluation metrics are used to evaluate the model by measuring the classification accuracy, recall and specificity. Lastly, the performance of Random Forest is compared with 3 other methods, namely kNN, Naïve Bayes and Logistic Regression. The result shows that Random Forest achieved the best performance among evaluated methods with the score of 98.1% for accuracy, 98.0% for recall, and 98.2% for specificity.

Keywords

Random forest, imbalanced dataset, random under-sampling, malware, classification.

Full Text:

References

O. Aslan and R. Samet, “A Comprehensive Review on Malware Detection Approaches,” IEEE Access, vol. 8, pp. 6249–6271, 2020, doi: 10.1109/ACCESS.2019.2963724.

N. Shahid et al., “Mathematical analysis and numerical investigation of advection-reaction-diffusion computer virus model,” Results Phys., vol. 26, p. 104294, 2021, doi: 10.1016/j.rinp.2021.104294.

F. A. Rafrastara and F. M. A, “Advanced Virus Monitoring and Analysis System,” 2011. [Online]. Available: http://sites.google.com/site/ijcsis/.

Fauzi Adi Rafrastara, Belajar Membuat Virus Komputer Mulai dari NOL. Semarang: Neomedia Press, 2007.

H. Shah and D. M. G. Comissiong, “Computer Virus Model with Stealth Viruses and Antivirus Renewal in a Network with Fast Infectors,” SN Comput. Sci., vol. 2, no. 5, pp. 1–8, 2021, doi: 10.1007/s42979-021-00780-9.

A. Pratama and F. A. Rafrastara, “Computer Worm Classification,” Int. J. Comput. Sci. Inf. Secur., vol. 10, no. 4, pp. 21–24, 2012.

N. Ochieng, W. Mwangi, and I. Ateya, “Optimizing Computer Worm Detection Using Ensembles,” Secur. Commun. Networks, vol. 2019, 2019, doi: 10.1155/2019/4656480.

A. Nugraha and F. A. Rafrastara, “BOTNET DETECTION SURVEY,” 2011.

D. Georgoulias, J. M. Pedersen, M. Falch, and E. Vasilomanolakis, “Botnet business models, takedown atempts, and the darkweb market: a survey,” ACM Comput. Surv., 2022, doi: 10.1145/3575808.

T. A. Tuan, H. V. Long, L. H. Son, R. Kumar, I. Priyadarshini, and N. T. K. Son, “Performance evaluation of Botnet DDoS attack detection using machine learning,” Evol. Intell., vol. 13, no. 2, pp. 283–294, 2020, doi: 10.1007/s12065-019-00310-w.

M. Wazzan, D. Algazzawi, O. Bamasaq, A. Albeshri, and L. Cheng, “Internet of things botnet detection approaches: Analysis and recommendations for future research,” Appl. Sci., vol. 11, no. 12, 2021, doi: 10.3390/app11125713.

M. Robles-Carrillo and P. García-Teodoro, “Ransomware: An Interdisciplinary Technical and Legal Approach,” Secur. Commun. Networks, vol. 2022, 2022, doi: 10.1155/2022/2806605.

W. Z. A. Zakaria, M. F. Abdollah, O. Mohd, M. S. M. M. Yassin, and A. Ariffin, “RENTAKA: A Novel Machine Learning Framework for Crypto-Ransomeware Pre-encryption Detection,” IJACSA) Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 5, 2022, [Online]. Available: www.ijacsa.thesai.org.

F. Sulianta, “Comparison of The Computer Viruses from Time to Time,” ASIA CAUCASUS English Ed., vol. 23, no. 1, p. 2022, 2022, [Online]. Available: https://doi.org/10.37178/ca-c.23.1.139.

F. Hidayat and T. M. S. Astsauri, “Applied random forest for parameter sensitivity of low salinity water Injection (LSWI) implementation on carbonate reservoir,” Alexandria Eng. J., vol. 61, no. 3, pp. 2408–2417, 2022, doi: 10.1016/j.aej.2021.06.096.

F. C. C. Garcia and F. P. Muga, “Random Forest for Malware Classification,” pp. 1–4, 2016, [Online]. Available: http://arxiv.org/abs/1609.07770.

H. J. Zhu, T. H. Jiang, B. Ma, Z. H. You, W. L. Shi, and L. Cheng, “HEMD: a highly efficient random forest-based malware detection framework for Android,” Neural Comput. Appl., vol. 30, no. 11, pp. 3353–3361, 2018, doi: 10.1007/s00521-017-2914-y.

B. M. Khammas, “Ransomware Detection using Random Forest Technique,” ICT Express, vol. 6, no. 4, pp. 325–331, 2020, doi: 10.1016/j.icte.2020.11.001.

I. Shhadat, B. Bataineh, A. Hayajneh, and Z. A. Al-Sharif, “The Use of Machine Learning Techniques to Advance the Detection and Classification of Unknown Malware,” Procedia Comput. Sci., vol. 170, no. 2019, pp. 917–922, 2020, doi: 10.1016/j.procs.2020.03.110.

Q. Fan, Z. Wang, D. Li, D. Gao, and H. Zha, “Entropy-based fuzzy support vector machine for imbalanced datasets,” Knowledge-Based Syst., vol. 115, pp. 87–99, 2017, doi: 10.1016/j.knosys.2016.09.032.

J. C. Alejandrino, J. P. Bolacoy, and J. V. B. Murcia, “Supervised and unsupervised data mining approaches in loan default prediction,” Int. J. Electr. Comput. Eng., vol. 13, no. 2, pp. 1837–1847, 2023, doi: 10.11591/ijece.v13i2.pp1837-1847.

M. Anis and M. Ali, “Investigating the Performance of Smote for Class Imbalanced Learning: A Case Study of Credit Scoring Datasets,” Eur. Sci. Journal, ESJ, vol. 13, no. 33, p. 340, 2017, doi: 10.19044/esj.2017.v13n33p340.

C. Drummond and R. C. Holte, “Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling,” Phys. Rev. Lett., vol. 91, no. 3, 2003.

B. M. Serinelli, A. Collen, and N. A. Nijdam, “Training guidance with KDD Cup 1999 and NSL-KDD data sets of ANIDINR: Anomaly-based network intrusion detection system,” Procedia Comput. Sci., vol. 175, no. 2019, pp. 560–565, 2020, doi: 10.1016/j.procs.2020.07.080.

S. Shakya and M. Dave, “Analysis, Detection, and Classification of Android Malware using System Calls,” 2022, [Online]. Available: https://arxiv.org/abs/2208.06130v1%0Ahttps:// arxiv.org/ftp/arxiv /papers/2208/ 2208.06130.pdf.

DOI: https://doi.org/10.30591/jpit.v8i2.5207