Combination of Instance Selection and Data Augmentation Techniques for Imbalanced Data Classification
Subject Areas : electrical and computer engineeringParastoo Mohaghegh 1 , Samira Noferesti 2 * , Mehri Rajaei 3
1 - Faculty of Electrical and Computer Engineering, University of Sistan and Baluchestan
2 -
3 - University of Sistan and Baluchestan
Keywords: Instance selection, data augmentation, classification, imbalanced data, data mining, machine learning,
Abstract :
Mohaghegh, S. Noferesti*, and M. Rajaei Abstract: In the era of big data, automatic data analysis techniques such as data mining have been widely used for decision-making and have become very effective. Among data mining techniques, classification is a common method for decision making and prediction. Classification algorithms usually work well on balanced datasets. However, one of the challenges of the classification algorithms is how to correctly predicting the label of new samples based on learning on imbalanced datasets. In this type of dataset, the heterogeneous distribution of the data in different classes causes examples of the minority class to be ignored in the learning process, while this class is more important in some prediction problems. To deal with this issue, in this paper, an efficient method for balancing the imbalanced dataset is presented, which improves the accuracy of the machine learning algorithms to correct prediction of the class label of new samples. According to the evaluations, the proposed method has a better performance compared to other methods based on two common criteria in evaluating the classification of imbalanced datasets, namely "Balanced Accuracy" and "Specificity".
[1] H. Kim, H. Cho, and D. Ryu, "Corporate bankruptcy prediction using machine learning methodologies with a focus on sequential data," Computational Economics, vol. 59, pp. 1231-1249, 2022.
[2] D. Yousif Mikhail, F. Al-Mukhtar, and S. Wahab Kareem, "A comparative evaluation of cancer classification via TP53 gene mutations using machine learning," Asian Pacific J. of Cancer Prevention, vol. 23, no. 7, pp. 2459-2467, Jul. 2022.
[3] L. Yang and Y. Jiachen, "Few-shot cotton pest recognition and terminal," Computers and Electronics in Agriculture, vol. 169, Article ID: 105240, 2020.
[4] P. Kumar, R. Bhatnagar, K. Gaur, and A. Bhatnagar, "Classification of imbalanced data: review of methods and applications," IOP Conf. Series: Materials Science and Engineering, vol. 1099, no 1, Article ID: 012077, 2021.
[5] C. F. Tsai, W. C. Lin, Y. H. Hu, and G. T. Yao, "Under-sampling class imbalanced datasets by combining clustering analysis and instance selection," Information Sciences, vol. 477, pp. 47-54, Mar. 2019.
[6] I. Czarnowski and P. Jedrzejowicz, "An approach to imbalanced data classification based on instance selection and over-sampling," in Proc. 11th Int. Conf.on Computational Collective Intelligence, pp. 601-610, Hendaye, France, 4-6 Sept. 2019.
[7] D. Gan, J. Shen, B. An, M. Xu, and N. Liu, "Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis," Computers & Industrial Engineering, vol. 140, Article ID: 106266, Feb. 2020.
[8] L. Yang and Y. Jiachen, "Meta-learning baselines and database for few-shot classification in agriculture," Computers and Electronics in Agriculture, vol. 182, Article ID: 106055, Mar. 2021.
[9] Z. Peng, Z. Li, J. Zhang, Y. Li, G. J. Qi, and J. Tang, "Few-shot image recognition with knowledge transfer," in Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 441-449, Seoul, South Korea, 27 Oct.-2 Nov. 2019.
[10] F. Jimenez, G. Sanchez, J. Palma, and G. Sciavicco, "Three-objective constrained evolutionary instance selection for classification: wrapper and filter approaches," Engineering Applications of Artificial Intelligence, vol. 107, Article ID: 104531, Jan. 2022.
[11] G. E. Melo-Acosta, F. Duitama-Muñoz, and J. D. Arias-Londoño, An Instance Selection Algorithm for Big Data in High Imbalanced Datasets Based on LSH, arXiv: 2210.04310, Oct. 2022.
[12] X. Chao and L. Zhang, "Few-shot imbalanced classification based on data augmentation," Multimedia Systems, vol. 29, no. 5, pp. 2843-2851, 2023.
[13] S. Bej, N. Davtyan, M. Wolfien, M. Nassar, and O. Wolkenhauer, "LoRas: an oversampling approach for imbalanced datasets," Machine Learning, vol. 110, pp. 279-301, 2021.
[14] J. C. Requelme, J. S. Aguilar-Ruiz, and M. Toro, "Finding representative patterns with ordered projections," Pattern Recognition, vol. 36, no. 4, pp. 1009-1018, Apr. 2003.
[15] D. R. Wilson and T. R. Martinez, "Instance pruning techniques," in Proc. of the 14th Int. Conf. on Machine Learning, pp. 400-411, 8-12 Jul. 1997.
[16] M. Moran, T. Cohen, Y. Ben-Zion, and G. Gordon, "Curious instance selection," Information Sciences, vol. 608, pp. 794-808, Aug. 2022.
[17] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," J. of Artificial Intelligence Research, vol. 16, pp. 321-357, Jan. 2002.
[18] ش. سرگلزایی، ف. حسینزاده سلجوقی و ﻫ. آقایاری، "ارائه روشی نوین برای رتبهبندی اعداد فازی با استفاده از مرکز محیطی دایره و کاربرد آن در ارزیابی عملکرد مدیریت زنجیره تأمین،" نشریه تصمیمگیری و تحقیق در عملیات، دوره 3، شماره 3، صص. 236-248، پاییز 1397.
[19] S. N. Kumpati and A. T. Mandayam, Learning Automata: An Introduction, Courier Corporation, 2012.
[20] J. C. Dominguz, et al., "Teaching chemical engeering using Jupyter notebook: problem generators and lecturing tools," Education for Chemical Engineers, vol. 37, pp. 1-10, Oct. 2021.
[21] M. Grandini, E. Bagli, and G. Visani, Multi-Class Classification: An Overview, arXiv:2008.05756, Aug. 2020.