تشخیص خودکار خطا در پایگاه داده، مبتنی بر خوشهبندی و نزدیکترین همسایگی
محورهای موضوعی : مهندسی برق و کامپیوترمهدیه عطاییان 1 , نگین دانشپور 2 *
1 - دانشگاه تربیت دبیر شهید رجايي
2 - دانشگاه تربیت دبیر شهید رجايي
کلید واژه: تصحیح داده تشخیص خودکار خطا خوشهبندی k-means,
چکیده مقاله :
کیفیت دادهها در امر تصمیمگیری سازمانها تأثیرگذار میباشد، به گونهای که تصمیمگیری مبتنی بر دادههای فاقد کیفیت سازمان را متحمل هزینههای بالایی میکند. کیفیت دادهها دارای ابعاد متنوعی میباشد که صحت از مهمترین این ابعاد است. جهت تصحیح دادهها نیاز به تشخیص خطا وجود دارد که با توجه به حجم بالای دادهها، نیاز به یک سیستم خودکار است تا بدون دخالت کاربر این فرایند انجام گیرد. در این مقاله راهکاری خودکار مبتنی بر خوشهبندی k - means جهت تشخیص خطا ارائه شده است. در ابتدا به ازای هر ویژگی، دادهها خوشهبندی میشوند و سپس به ازای هر داده در آن خوشه از روش شبه k نزدیکترین همسایه، جهت شناسایی خطا استفاده میشود. روش پیشنهادی توانایی تشخیص چندین خطا در یک رکورد را دارد و همچنین قادر است خطا در فیلدهایی با انواع داده متفاوت را نیز شناسایی کند. آزمایشات نشان میدهد که به طور متوسط این روش میتواند 91% خطاهای موجود در دادهها را شناسایی نماید. همچنین روش پیشنهادی با یک روش تشخیص خطا به وسیله قوانین که همانند راهکار پیشنهادی روشی خودکار برای تشخیص خطا در انواع دادهای متفاوت است نیز مورد مقایسه قرارگرفته و نتایج نشان میدهد که روش پیشنهادی به طور متوسط 25% عملکرد بهتری در تشخیص خطا داشته است.
Data quality affects on companies decision making, so that decisions based on data without quality incur companies high costs. Data quality has various dimensions and accuracy is the most important of these dimensions. Error detection is needed for data cleaning. Due to the huge volume of data, an automatic system is needed to perform this process without user interaction. In this paper an approach is proposed based on k-means clustering for error detection. Firstly data are clustered for each attribute. Then for each data in each cluster a method similar to k-nearest neighbor is used for detecting errors. The proposed method is able to detect multiple errors in one record. Also this approach is able to detect errors in fields with various attribute types. Experimental results show that this approach can detect 91% of errors in data on average. Also the proposed approach is compared with an automatic method which detects errors based on rule in various attribute types. Experimental results show that the proposed approach has on average 25%better performance to detect errors.
[1] G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin, "Sampling from repairs of conditional functional dependency violations," The VLDB Journal, vol. 23, no. 1, pp. 103-128, Feb, 2014.
[2] W. Fan, "Dependencies revisited for improving data quality," in Proc. 27th Int. Conf. on Management of Data, pp. 159-170, Vancouver, Canada, 9-12 Jun. 2008.
[3] W. Ahmed Malik and A. Unwin, "Automated error detection using association rules," Intelligent Data Analysis, vol. 15, no. 5, pp. 749-761, Sept. 2011.
[4] P. H. Williams, C. R. Margules, and D. W. Hilbert, "Data requirements and data sources for biodiversity priority area selection," J. of Biosciences, vol. 27, no. 4, pp. 327-338, Jul. 2002.
[5] S. Bruggemann, "Rule mining for automatic ontology based data cleaning," in Progress in WWW Research and Development, pp. 522-527, 2008.
[6] G. Rahman and Z. Islam, "Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques," Knowledge-Based Systems, vol. 53, pp. 51-65, Nov. 2013.
[7] G. Rahman and Z. Islam, "Decision tree-based missing value imputation technique for data pre-processing," Research and Practice in Information Technology, vol. 121, no. 1, pp. 41-50, Dec. 2011.
[8] L. Breiman, "Bagging predictors," Machine Learning, vol. 24, no. 2, pp. 123-140, Aug. 1996.
[9] M. Yakout and L. Berti-Equille, and A. K. Elmagarmid, "Don't be SCAREd: use scalable automatic repairing with maximal likelihood and bounded changes," in Proc. 13th Int. Conf. on Management of Data, pp. 553-564, New York, USA, 22-27 Jun. 2013.
[10] N. Tang, "Big data cleaning," in Proc. 16th Int. Conf.in Web Technologies and Applications, pp. 13-24, Changsha, China, 5-7 Sept. 2014.
[11] J. Hipp, U. Guntzer, and U. Grimmer, "Data quality mining-making a virute of necessity," in Proc. 6th Int. SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD'01, pp. 52-57, Santa Barbara, California, USA, May, 2001.
[12] C. He, Z. Tan, Q. Chen, C. Sha, Z. Wang, and W. Wang, "Repair diversification for functional dependency violations," in Proc. 19th Int. Conf.in Database Systems for Advanced Applications,, pp. 468-482, Bali, Indonesia, 21-24 April, 2014.
[13] M. Hamad and A. Abdulkhar Jihad, "An enhanced technique to clean data in the data warehouse," in Proc. 11thInt. Conf. in Developments in E-systems Engineering, pp. 306-311, Washington, DC, USA, 6-8 Dec. 2011.
[14] C. Teng, "Correcting noisy data," in Proc. 16th Int. Conf.in Machine Learning,, pp. 239-248, San Francisco, CA, USA, 27-30 Jun. 1999.
[15] C. Teng, "A comparison of noise handling techniques," in Proc. 14th Int. Florida Artificial Intelligence Research Society, pp. 269-273, Key West, FL, USA, 21 – 23 May, 2001.
[16] C. Teng, "Polishing blemishes: issues in data correction," Intelligent Systems, vol. 19, no. 2, pp. 34-39, Mar. 2004.
[17] A. Lopatenko and L. Bravo, "Efficient approximation algorithms for repairing inconsistent databases," in Proc. IEEE 23rd Int. Conf. on Data Engineering, ICDE'07, pp. 216-225, 15-20 Apr. 2007.
[18] V. J. Hodge and J. Austin, "A survey of outlier detection methodologies," Artificial Intelligence Review, vol. 22, no. 2, pp. 85-126, Oct. 2004.
[19] S. Chawla and A. Gionis, "k-means: a unified approach to clustering and outlier detection," in Proc. 13th SIAM Int. Conf. on Data Mining, pp. 189-197, Austin, Texas, USA, 2-4 May 2013.
[20] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Perez, and I. Perona, "An extensive comparative study of cluster validity indices," Pattern Recognition, vol. 46, no. 1, pp. 243-256, Jan. 2013.
[21] P. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis," J. of Computational and Applied Mathematics, vol. 20, no. 1, pp. 53-65, Nov. 1987.
[22] J. Han, M. Kamber, and J. Pei, Data Mining Concept and Technieques, pp. 451-471, 3 Edition, 2011.