استفاده از خوشه‌بندی و رویکردی ترکیبی برای پرکردن مقادیر جاافتاده عددی

محورهای موضوعی : مهندسی برق و کامپیوتر

1 - دانشگاه تربیت دبیر شهیدرجایی

تاریخ دریافت : 1396/09/08 تاریخ پذیرش : 1396/09/08 تاریخ انتشار : 1396/09/07

کلید واژه: رگرسیون مقادیر جاافتاده نزدیک‌ترین همسایگان همبستگی,

چکیده مقاله :

تخمین مقادیر جاافتاده یک گام مهم در پیش‌پردازش داده‌ها است. در این مقاله ‌یک رویکرد دومرحله‌ای برای پرکردن مقادیر جاافتاده عددی ارائه شده است. در مرحله اول داده‌ها خوشه‌بندی می‌شوند و در مرحله دوم داده‌های جاافتاده درون هر خوشه با استفاده از یک روش‌ ترکیبی از k نزدیک‌ترین همسایه وزن‌دار و رگرسیون خطی تخمین زده می‌شوند. از معیار همبستگی بین صفات در هر خوشه برای تعیین روش پرکردن داده‌های جاافتاده استفاده می‌‌شود. کیفیت پرکردن مقادیر جاافتاده با استفاده از معیار میانگین مربعات خطا سنجیده می‌شود. تأثیر پارامترهای مختلف بر میزان خطای داده‌های تخمین زده شده بررسی می‌‌گردد. عملکرد روش ارائه‌شده برای تخمین داده‌های جاافتاده بر روی پنج مجموعه داده نیز‌ بررسی می‌شود. در نهایت عملکرد روش ارائه‌شده با چهار روش پرکردن با مقدار میانگین، روش تخمین با شبکه عصبی پرسپترون چندلایه (MLP)، روش پرکردن با خوشه‌بندی c-means فازی و روش k خوشه‌ و نزدیک‌ترین همسایه مبتنی بر دسته (CKNNI) مقایسه می‌شود. نتایج به دست آمده نشان داده‌ که خطای تخمین مقادیر جاافتاده در روش ارائه‌شده کمتر از خطا در دیگر روش‌های مقایسه‌شده است.

چکیده انگلیسی:

Estimation of missing values is an important step in the preprocessing. In this paper, at two-step approach is proposed to fill the numeric missing values. In the first step, data is clustered. In the second step, the missing data in each cluster are estimated using a combination of weighted k nearest neighbors and linear regression methods. The correlation measure is employed to determine the appropriate method for the filling of missing data in each cluster. The quality of estimated missing values is evaluated using the root mean squared error (RMSE) criterion. Effect of different input parameters on the error of estimated values is investigated. Moreover, the performance of the proposed method for the estimation purpose is evaluated on five datasets. Finally, the efficiency of the proposed method is compared to four different estimation methods, namely, Mean estimation, multi-layer perceptron (MLP) based estimation, fuzzy C-means (FCM) based approximation method, and Class-based K-clusters nearest neighbor imputation (CKNNI) method. Experimental results show that the proposed method produces less error in comparison to other compared methods, in most of the cases.

منابع و مأخذ:

[1] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, Second Edition, John Wiley Sons, Inc., pp. 11-15, 2002.
[2] I. B. Aydilek and A. Arslan, "A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm," Information Sciences, vol. 233, no. 2, pp. 25-35, Jun. 2013.
[3] B. van Stein and W. Kowalczyk, "An incremental algorithm for repairing training sets with missing values," in Proc. Int. Conf. on Information Processing and Management of Uncertainty in Knowledge-Based Systems, vol. 611, pp. 175-186, Jun. 2016.
[4] A. A. Chavan and V. K. Verma, "Treatment of missing values for association rules: a recent survey," International J. of Computer Applications, vol. 70, no. 26, pp. 1-4, May 2013.
[5] E. L. Silva-Ramrez, R. Pino-Mejas, and M. Lopez-Coello, "Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns," Applied Soft Computing, vol. 29, no. 1, pp. 65-74, Apr. 2015.
[6] P. Raja and K. Thangavel, "Soft clustering based missing value imputation," in Proc. Annual Convention of the Computer Society of India: Digital Connectivity-Social Impact, vol. 679, pp. 119-133, Dec. 2016.
[7] C. Jiang and Z. Yang, "CKNNI: an improved knn-based missing value handling technique," in Proc. Int. Conf. on Intelligent Computing, pp. 441-452, Aug. 2015.
[8] C. H. Wu, C. H. Wun, and H. J. Chou, "Using association rules for completing missing data," in Proc. IEEE Fourth Int. Conf. on Hybrid Intelligent Systems, HIS'04, pp. 236-241, 5-8 Dec. 2004.
[9] J. Wu, Q. Song, and J. Shen, "An novel association rule mining based missing nominal data imputation method," in Proc. IEEE Eighth ACIS Int. Conf. on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, SNPD, vol. 3, pp. 244-249, Jul. 2007.
[10] N. Singh, A. Javeed, S. Chhabra, and P. Kumar, "Missing value imputation with unsupervised Kohonen self organizing map," in Emerging Research in Computing, Information, Communication and Applications, pp. 61-76, Jul. 2015.
[11] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 3rd Ed., pp. 398-408, 2011.
[12] R. Krishnamoorthy, S. Sreedhar Kumar, and B. Neelagund, "A new approach for data cleaning process," in Proc. IEEE Recent Advances and Innovations in Engineering, ICRAIE'14, 5 pp. Jul. 2014.
[13] C. Zhang, X. Zhu, J. Zhang, Y. Qin, and S. Zhang, "GBKII: an imputation method for missing values," in Advances in Knowledge Discovery and Data Mining, vol. 11, pp. 1080-1087, May 2007.
[14] E. R. Hruschka, E. R. Hruschka, and N. F. F. Ebecken, "Evaluating a nearest-neighbor method to substitute continuous missing values," in Proc. Australasian Joint Conf. on Artificial Intelligence, vol. 16, pp. 723-734, Dec. 2003.
[15] C. Ye and H. Wang, "Capture missing values based on crowdsourcing," in Proc. of the 9th Int. Conf. on Wireless Algorithms, Systems, and Applications, WASA'14, pp. 783-792, Jun. 2014.
[16] B. M. Patil, R. C. Joshi, and D. Toshniwal, "Missing value imputation based on k-mean clustering with weighted distance," in Proc. Int. Conf. on Contemporary Computing, vol. 3, pp. 600-609, Aug. 2010.
[17] V. V. Ayuyev, J. Jupin, P. W. Harris, and Z. Obradovic, "Dynamic clustering-based estimation of missing values in mixed type data," in Proc. Int. Conf. on Data Warehousing and Knowledge Discovery, vol. 11, pp. 366-377, Aug. 2009.
[18] D. Li, J. Deogun, W. Spaulding, and B. Shuart, "Towards missing data imputation: a study of fuzzy k-means clustering method," in Rough Sets and Current Trends in Computing, vol. 3066, pp. 573-579, Jun. 2004.
[19] N. Ankaiah and V. Ravi, "A novel soft computing hybrid for data imputation," in Proc. of the 7th Int. Conf. on Data Mining, DMIN'11, pp. 65-71, Jul. 2011.
[20] S. Azim and S. Aggarwal, "Hybrid model for data imputation: using fuzzy c means and multi layer perceptron," in Proc. IEEE Int. Advance Computing Conf., IACC'14, vol. 4, pp. 1281-1285, Feb. 2014.
[21] S. Bashir, S. Razzaq, U. Maqbool, S. Tahir, and A. R. Baig, Using Association Rules for Better Treatment of Missing Values, arXiv preprint arXiv: 0904.3320, 2009.
[22] G. Rahman and Z. Islam, "A decision tree-based missing value imputation technique for data pre-processing," in Proc. of the Ninth Australasian Data Mining Conf., AusDM'11, vol. 121, pp. 41-50, Dec. 2011.
[23] C. F. Tsai and F. Y. Chang, "Combining instance selection for better missing value imputation," J. of Systems and Software, vol. 122, no. 1, pp. 63-71, Dec. 2016.
[24] D. R. Wilson and T. R. Martinez, "Reduction techniques for instance-based learning algorithms," Machine Learning, vol. 38, no. 3, pp. 257-286, Mar. 2000.
[25] M. Amiri and R. Jensen, "Missing data imputation using fuzzy-rough methods," Neurocomputing, vol. 205, no. 1, pp. 152-164, Sept. 2016.
[26] M. Lichman, UCI Machine Learning Repository School of Information and Computer Science, Irvine, CA: University of California, 2013.
[27] A. K. Jain, "Data clustering: 50 years beyond k-means," Pattern Recogn. Lett., vol. 31, no. 8, pp. 651-666, Jun. 2010.

مقالات مرتبط

انتقال داده بهینه در شبکه های اینترنت اشیا مبتنی بر حسگر بی سیم با تلفیق برنامه ریزی خطی و درخت انتشار کمینه
تاریخ چاپ : 1404/10/16
بهینه سازی و پیش بینی برنامه های موردعلاقه کاربران با استفاده از رویکرد فیلترینگ مشارکتی و الگوریتم فاخته
تاریخ چاپ : 1404/10/16
مدل سازی اندازه کاشی بهینه برای افزایش استفاده مجدد از داده ها در شبکه های عصبی کانولوشنی
تاریخ چاپ : 1404/10/16
تشخیص سرطان سینه با رویکرد متوازن‌سازی مجموعه داده‌ها
تاریخ چاپ : 1404/10/16
کاهش درصد خطای پیش‌بینی سری‌های‌ زمانی قیمت رمزارزها با استفاده از دوسویه‌سازی شبکه‌های عصبی یادگیری عمیق
تاریخ چاپ : 1404/10/16
استخراج ویژگی‌های عمیق بلندمدت برای طبقه‌بندی ویدیو
تاریخ چاپ : 1404/10/16

اشتراک گذاری

آدرس مقاله

استفاده از خوشه‌بندی و رویکردی ترکیبی برای پرکردن مقادیر جاافتاده عددی