Generative AI vs SMOTE: Studi Kasus Penyeimbangan Data Teks pada Sentimen Analisis

Dyah Sulistyowati Rahayu; Iman Paryudi; Erin Divayaning; Afni Puspita Zahra; Arsya Yan Duribta

doi:10.30591/jpit.v11i2.10157

Generative AI vs SMOTE: Studi Kasus Penyeimbangan Data Teks pada Sentimen Analisis

Dyah Sulistyowati Rahayu, Iman Paryudi, Erin Divayaning, Afni Puspita Zahra, Arsya Yan Duribta

Abstract

– Imbalanced data remains a major challenge in sentiment analysis, where the dominance of positive reviews often leads to biased classification results and weak recognition of minority classes. This study aims to address the imbalance problem by applying Large Language Models (LLM) to generate synthetic negative reviews and comparing the results with the traditional SMOTE method. The research process begins with data collection through web scraping, followed by preprocessing using standard text cleaning techniques such as tokenization, stopword removal, and stemming. Augmentation is then performed with LLM to produce additional negative samples, while SMOTE is applied as a baseline method. The classification task is conducted using Support Vector Machine (SVM) with TF-IDF representation, and model performance is evaluated using accuracy, precision, recall, and F1-score. The findings show that LLM augmentation produces synthetic data highly similar to the original distribution, as confirmed by Kolmogorov-Smirnov and Wasserstein Distance tests. Furthermore, the SVM model trained with LLM-augmented data achieved higher accuracy and balanced performance compared to SMOTE, particularly in handling minority classes. In conclusion, the use of LLM provides a more effective and natural approach for text data balancing in sentiment analysis, offering significant improvement in classification quality. Future research may explore the integration of LLM with other generative models to extend applications to numerical and multimodal datasets.

Keywords

Generative AI; Imbalanced Data; LLM; Sentiment Analysis; SMOTE

Full Text:

References

Santhosh Kumar, B., Yadav, P., & Prasad, P. (2025). Performance analysis of machine learning algorithms on imbalanced datasets using SMOTE technique. In A. Kumar, V. K. Gunjan, S. Senatore, & Y. C. Hu (Eds.), Proceedings of the 5th International Conference on Data Science, Machine Learning and Applications; Volume 1. ICDSMLA 2023. Lecture Notes in Electrical Engineering, 1273. Springer. https://doi.org/10.1007/978-981-97-8031-0_15

Matar, N., Sowan, B., & Al-Jaber, A. (2024). Evaluating models performance for credit risk detection for imbalanced data. International Conference on Cyber Resilience (ICCR), 1–6. https://doi.org/10.1109/ICCR61006.2024.10532912

Călin, S. (2025). Handling imbalanced data: The SMOTE technique. International Conference on Electronics, Computers and Artificial Intelligence (ECAI), 1–5. https://doi.org/10.1109/ECAI65401.2025.11095450

Joloudari, J. H., Marefat, A., Nematollahi, M. A., Oyelere, S. S., & Hussain, S. (2023). Effective class-imbalance learning based on SMOTE and convolutional neural networks. Applied Sciences, 13(6), 4006. https://doi.org/10.3390/app13064006

Nanda, N. N. A., Farida, Y., & Utami, W. D. (2025). Implementation of SMOTE to improve the performance of random forest classification in credit risk assessment in banking. INTENSIF: Jurnal Ilmiah Penelitian dan Penerapan Teknologi Sistem Informasi, 9(2), 158–175. https://doi.org/10.29407/intensif.v9i2.23930

Decaro, C., Montanari, G. B., Bianconi, M., & Bellanca, G. (2021). Prediction of hematocrit through imbalanced dataset of blood spectra. Healthcare Technology Letters, 8(2), 37–44. https://doi.org/10.1049/htl2.12006

Butt, A. H., Khan, Z., Khan, A., Ghazanfar, H., Zgheib, R., & Kamalov, F. (2024). Performance of sampling methods on imbalanced data: Comparative analysis. Advances in Science and Engineering Technology International Conferences (ASET), 1–6. https://doi.org/10.1109/ASET60340.2024.10708760

Dandu, M. M. K., Jain, J., Vijayabaskar, S., Goel, P., Shivarudra, A., & Bhatt, S. (2024). Assessing the impact of data imbalance on the predictive performance of machine learning models. International Conference on Contemporary Computing and Informatics (IC3I), 1062–1068. https://doi.org/10.1109/IC3I61595.2024.10829313

Cheng, H. (2023). Support vector machine with SMOTE based on correlated covariates. International Conference on Automation, Robotics and Computer Engineering (ICARCE), 1–4. https://doi.org/10.1109/ICARCE59252.2024.10492566

Zhang, C., Song, J., Pei, Z., & Jiang, J. (2016). An imbalanced data classification algorithm of de-noising auto-encoder neural network based on SMOTE. MATEC Web of Conferences, 56, 01014. https://doi.org/10.1051/matecconf/20165601014

Wei, Z., & Chen, Y. (2024). NLKF-SMOTE: A novel noise-filtering SMOTE without nearest neighbor parameter K for oversampling. International Conference on Computer Science and Artificial Intelligence (CSAI), 1–7. https://doi.org/10.1145/3709026.3709035

Gopali, S., Abri, F., Siami Namin, A., & Jones, K. S. (2024). The applicability of LLMs in generating textual samples for analysis of imbalanced datasets. IEEE Access, 12, 136451–136465. https://doi.org/10.1109/ACCESS.2024.3463400

Chen, M.-Y., Chiang, H.-S., & Huang, W.-K. (2022). Efficient generative adversarial networks for imbalanced traffic collision datasets. IEEE Transactions on Intelligent Transportation Systems, 23(10), 19864–19873. https://doi.org/10.1109/TITS.2022.3162395

Jiang, M., Liang, Y., Han, S., Ma, K., Chen, Y., & Xu, Z. (2024). Leveraging generative adversarial networks for addressing data imbalance in financial market supervision. International Conference on Big Data Economy and Information Management (BDEIM), 1–6. https://doi.org/10.1145/3724154.3724263

Chen, J., Zhang, Y., Wang, B., Zhao, W. X., Wen, J.-R., & Chen, W. (2024, June). Unveiling the flaws: Exploring imperfections in synthetic data and mitigation strategies for large language models. arXiv. http://arxiv.org/abs/2406.12397

Amin, K., Babakniya, S., Bie, A., Kong, W., Syed, U., & Vassilvitskii, S. (2025, February). Escaping collapse: The strength of weak data for large language model training. arXiv. http://arxiv.org/abs/2502.08924

Kennanya, M. S., Meena, T., Pravardhitha, M. S., & Vignesh, A. S. (2023). Classification of potentially hazardous asteroids using artificial neural networks and over sampling techniques. Global Conference on Information Technologies and Communications (GCITC), 1–6. https://doi.org/10.1109/GCITC60406.2023.10426106

Bhagwani, H., Agarwal, S., Kodipalli, A., & Martis, R. J. (2021). Targeting class imbalance problem using GAN. International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT), 318–322. https://doi.org/10.1109/ICEECCOT52851.2021.9708011

Yang, H., & Zhou, Y. (2021). IDA-GAN: A novel imbalanced data augmentation GAN. International Conference on Pattern Recognition (ICPR), 8299–8305. https://doi.org/10.1109/ICPR48806.2021.9411996

DOI: https://doi.org/10.30591/jpit.v11i2.10157