یک روش بدون پارامتر مبتنی بر نزدیکی برای تشخیص دادههای پرت
محورهای موضوعی : مهندسی برق و کامپیوتریحیی صالحی 1 , نگین دانشپور 2 *
1 - دانشگاه تربیت دبیر شهید رجایی
2 - دانشگاه شهید رجایی
کلید واژه: بدون پارامترتشخیص دادههای پرتمبتنی بر نزدیکی,
چکیده مقاله :
تشخیص دادههای پرت به عنوان یک حوزه تحقیق در دادهکاوی و یادگیری ماشین بوده و یک گام مهم در پیشپردازش دادهها به حساب میآید. در این مقاله یک روش بدون پارامتر به منظور تشخیص دادههای پرت مبتنی بر نزدیکی به نام NPOD ارائه شده است. رهیافت ارائهشده، ترکیبی از روشهای مبتنی بر فاصله و مبتنی بر چگالی بوده و توانایی تشخیص پرتها را به صورت سراسری و محلی دارد. این روش نیاز به تعیین هیچ یک از پارامترهای شعاع همسایگی، حد آستانه نقاط موجود در شعاع همسایگی و پارامتر نزدیکترین همسایگی ندارد. NPOD برای تشخیص دادههای پرت، یک روش جدید نمرهدهی ارائه میدهد. ارزیابی نتایج بر روی مجموعه دادههای UCI نشان میدهد که این الگوریتم با وجود بدون پارامتر بودنش، عملکردی قابل رقابت با روشهای پیشین و در بعضی مواقع بهترین عملکرد را دارد.
The detection of outliers is a task in data mining and machine learning and it’s an important step in data preprocessing. In this paper, in order to detect proximity-based outliers, a non-parametric method is proposed called NPOD. The proposed method is a combination of distance-based and density-based methods and has the ability to detect outliers in both local and global scenarios. This method does not require to determine any parameters of neighborhood radius, the threshold of existing points in the neighborhood radius, and the nearest neighbor parameters. In order to detect outliers, a new method of scoring is presented. Experimental results on the UCI datasets show that this algorithm, in spite of being non-parametric, has comparable results with previous methods. Also in some cases, it has the best performance.
[1] D. Hawkins, Identification of Outliers, Springer Sci. Bus. Media, 1980.
[2] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd Ed. Elsevier Inc, Morgan Kaufmann, 2012.
[3] M. Gupta, J. Gao, and C. C. Aggarwal, "Outlier detection for temporal data: a survey," IEEE Trans. on Knowledge and Data Engineering, vol. 25, no. 9, pp. 2250-2267, Dec.. 2013.
[4] M. Salehi, C. Leckie, J. C. Bezdek, and L. Fellow, "Fast memory efficient local outlier detection in data streams," IEEE Trans. on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3246-3260, Aug. 2016.
[5] K. Yamanishi, J. Takeuchi, and G. Williams, "On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms," Data Mining and Knowledge Discovery, vol. 8, no. 3, pp. 275-300, May 2004.
[6] J. Huang, Q. Zhu, L. Yang, D. Cheng, and Q. Wu, "A novel outlier cluster detection algorithm without top-n parameter," Knowledge-Based Systems, vol. 121, pp. 32-40, 1 Apr. 2017.
[7] G. Gan and M. K. Ng, "K-means clustering with outlier removal," Pattern Recognition Letter, vol. 90, pp. 8-14, Apr. 2017.
[8] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander, "LOF: identifying density-based local outliers," in Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 93-104, Dalles, TX,USA, 15-18 May 2000.
[9] E. Schubert, A. Zimek, and H. Kriegel, "Local outlier detection reconsidered : a generalized view on locality with applications to spatial, video, and network outlier detection," Data Mining and Knowledge Discovery, vol. 28, no. 1, pp. 190-237, Jan. 2014.
[10] W. Jin, A. K. H. Tung, J. Han, and W. Wang, "Ranking outliers using symmetric neighborhood relationship," in Proc. of the 10th Pacific-Asia Conf.on Advances in Knowledge Discovery and Data Mining, Singapore, Singapore, 9-12 Apr. 2006.
[11] H. Kriegel, P. Kroger, E. Schubert, and A. Zimek, "LoOP: local outlier probabilities," in Proc. of the 18th ACM Conf. on Information and Knowledge Management, pp. 1649-1652, Hong Kong, China, 2-8 Nov. 2009.
[12] K. Zhang, M. Hutter, and H. Jin, "A new local distance-based outlier detection approach for scattered real-world data," in Advances in Knowledge Discovery and Data Mining, pp. 1-15, 2009.
[13] E. Schubert and H. Kriegel, "Generalized outlier detection with flexible kernel density estimates," in Proc. of the SIAM Int. Conf. on Data Mining, 9 pp., Philadelphia, PN, USA, 24-26 Apr. 2014.
[14] V. Hautam and K. Ismo, "Outlier detection using k-nearest neighbour graph," in Proc. of the 17th Int. Conf. on Pattern Recognition, ICPR'04, vol. 3, pp. 430-433, Cambridge, UK, 26-26 Aug. 2004.
[15] F. Angiulli and C. Pizzuti, "Fast Outlier Detection in High Dimensional Spaces," in Principles of Data Mining and Knowledge Discovery, pp. 15-27, 2002.
[16] S. Papadimitriou, P. B. Gibbons, C. Faloutsos, and C. Faloutsos, "LOCI: fast outlier detection using the local correlation integral," in Proc. 19th Int. Conf. on Data Engineering, pp. 315-326, Bangalore, India, 5-8 Mar. 2003.
[17] Q. Zhu, J. Feng, and J. Huang, "Natural neighbor: a self-adaptive neighborhood method without parameter K," Pattern Recognition Letter, vol. 80, no. 1, pp. 30-36, Sept. 2016.
[18] R. L. Burden, J. D. Faires, and A. M. Burden, Numerical Analysis, 10th Ed. Cengage Learning US, 2015.
[19] C. F. Gerald and P. O. Wheatley, Applied Numerical Analysis, 7th Ed. Pearson Addison-Wesley, 2006.
[20] A. Zimek, R. J. G. B. Campello, and J. Sander, "Ensembles for unsupervised outlier detection: challenges and research questions," ACM SIGKDD Explorations Newsletter, vol. 15, no. 1, pp. 11-22, Mar. 2014.
[21] G. O. Campos, et al., "On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study," Data Mining and Knowledge Discovery, vol. 30, no. 4, pp. 891-927, Jul. 2016.