راهکاری مبتنی بر ساخت درخت دودویی تقریبی برای سرعتبخشیدن به جستجوی نزدیکترین همسایگی در دادههای حجیم
محورهای موضوعی : مهندسی برق و کامپیوترحسین کلاته 1 , نگین دانشپور 2 *
1 - دانشگاه تربیت دبیر شهید رجائی،دانشكده مهندسي كامپيوتر
2 - دانشگاه تربیت دبیر شهید رجائی،دانشکده مهندسی کامپیوتر
کلید واژه: بافر همپوشانی, دادههای حجیم, درخت تصمیم دودویی, طبقهبندی نزدیکترین همسایگی,
چکیده مقاله :
با توجه به سرعت روزافزون تولید اطلاعات و نیاز تبدیل اطلاعات به دانش، روشهای یادگیری ماشین قدیمی دیگر پاسخگو نیستند. هنگام استفاده از طبقهبندیها با روشهای یادگیری ماشین قدیمی، به ویژه استفاده از طبقهبندیهای ذاتاً تنبل مانند روش k- نزدیکترین همسایگی (KNN)، عملیات طبقهبندی دادههای حجیم بسیار کند است. نزدیکترین همسایگی به دلیل سادگی و دقت عملی که ارائه میدهد یک روش محبوب در زمینه طبقهبندی دادهها میباشد. روش پیشنهادی مبتنی بر مرتبسازی بردارهای ویژگی دادههای آموزشی در یک درخت جستجوی دودویی است تا طبقهبندی دادههای بزرگ را با استفاده از روش نزدیکترین همسایگی تسریع بخشد. این کار با استفاده از یافتن تقریبی دو دورترین داده محلی در هر گره درخت انجام میشود. این دو داده به عنوان معیار برای تقسیم دادههای موجود در گره فعلی بین دو گروه، مورد استفاده قرار میگیرند. مجموعه دادههای موجود در هر گره بر اساس شباهت آنها به این دو داده، به فرزند چپ یا راست گره فعلی تخصیص داده میشوند. نتایج آزمایشهای متعدد انجامشده بر روی مجموعه دادههای مختلف از مخزن UCI، میزان دقت خوب با توجه به زمان اجرای کم روش پیشنهادی را نشان میدهد.
Due to the increasing speed of information production and the need to convert information into knowledge, old machine learning methods are no longer responsive. When using classifications with the old machine learning methods, especially the use of inherently lazy classifications such as the k-nearest neighbor (KNN) method, the operation of classifying large data sets is very slow. Nearest Neighborhood is a popular method of data classification due to its simplicity and practical accuracy. The proposed method is based on sorting the training data feature vectors in a binary search tree to expedite the classification of big data using the nearest neighbor method. This is done by finding the approximate two farthest local data in each tree node. These two data are used as a criterion for dividing the data in the current node into two groups. The data set in each node is assigned to the left and right child of the current node based on their similarity to the two data. The results of several experiments performed on different data sets from the UCI repository show a good degree of accuracy due to the low execution time of the proposed method.
[1] X. Lv, "The big data impact and application study on the like ecosystem construction of open internet of things," Cluster Computing, vol. 22, no. 2, pp. 3563-3572, Mar. 2019.
[2] V. Marx, Biology: the Big Challenges of Big Data, Ed: Nature Publishing Group, 2013.
[3] I. Triguero, J. Maillo, J. Luengo, S. García, and F. Herrera, "From big data to smart data with the k-nearest neighbours algorithm," in Proc. IEEE Int. Conf. on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 859-864, Chengdu, China, 15- 18 Dec. 2016.
[4] D. W. Aha, D. Kibler, and M. K. Albert, "Instance-based learning algorithms," Machine Learning, vol. 6, no. 1, pp. 37-66, Jan. 1991.
[5] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Trans. on Information Theory, vol. 13, no. 1, pp. 21-27, Jan. 1967.
[6] D. W. Aha, Lazy Learning, in Lazy learning: Kluwer Academic Publishers, pp. 7-10, 1997.
[7] P. P. Anchalia and K. Roy, "The k-nearest neighbor algorithm using mapreduce paradigm," in Proc. 5th IEEE Int. Conf. on Intelligent Systems, Modelling and Simulation, pp. 513-518, Langkawi, Malaysia, 27- 9 Jan. 2014.
[8] H. Saadatfar, S. Khosravi, J. H. Joloudari, A. Mosavi, and S. Shamshirband, "A new k-nearest neighbors classifier for big data based on efficient data pruning," Mathematics, vol. 8, no. 2, p. 286, Feb. 2020.
[9] A. Hassanat, "Norm-based binary search trees for speeding up knn big data classification," Computers, vol. 7, no. 4, pp. 5, Oct. 2018. [10] D. Zhu, "Humor robot and humor generation method based on big data search through IOT," Cluster Computing, vol. 22, no. 4, pp. 9169-9175, Jul. 2019.
[11] J. Calvo-Zaragoza, J. J. Valero-Mas, and J. R. Rico-Juan, "Improving kNN multi-label classification in prototype selection scenarios using class proposals," Pattern Recognition, vol. 48, no. 5, pp. 1608-1622, May 2015.
[12] X. Wu, X. Zhu, G. Q. Wu, and W. Ding, "Data mining with big data," IEEE Trans. on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97-107, Jun. 2013.
[13] S. García, J. Luengo, and F. Herrera, Data Preprocessing in Data Mining, Springer, 2015.
[14] L. Micó and J. Oncina, "A constant average time algorithm to allow insertions in the LAESA fast nearest neighbour search index," in Proc. 20th IEEE Int. Conf. on Pattern Recognition, pp. 3911-3914, Istanbul, Turkey, 23- 26 Aug. 2010.
[15] A. Bifet, et al., "Extremely fast decision tree mining for evolving data streams," in Proc. of the 23rd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 1733-1742, Halifax, Canada, 13-17 Aug. 2017.
[16] A. B. Hassanat, "Furthest-pair-based binary search tree for speeding big data classification using k-nearest neighbors," Big Data, vol. 6, no. 3, pp. 225-235, Sept 2018.
[17] X. Wu, et al., "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37, Jan. 2008.
[18] E. Plaku and L. E. Kavraki, "Distributed computation of the kNN graph for large high-dimensional point sets," J. of Parallel and Distributed Computing, vol. 67, no. 3, pp. 346-359, Mar. 2007.
[19] C. Zhang, F. Li, and J. Jestes, "Efficient parallel kNN joins for large data in mapreduce," in Proc. of the 15th Int. Conf. on Extending Database Technology, pp. 38-49, Berlin, Germany, 27-30 Mar. 2012.
[20] K. Sun, H. Kang, and H. H. Park, "Tagging and classifying facial images in cloud environments based on KNN using mapreduce," Optik, vol. 126, no. 21, pp. 3227-3233, Nov. 2015.
[21] J. Maillo, S. Ramírez, I. Triguero, and F. Herrera, "kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data," Knowledge-Based Systems, vol. 117, no. 1, pp. 3-15, Feb. 2017.
[22] S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woźniak, J. M. Benítez, and F. Herrera, "Nearest neighbor classification for high-speed big data streams using spark," IEEE Trans. on Systems, Man, and Cybernetics: Systems, vol. 47, no. 10, pp. 2727-2739, Jul. 2017.
[23] G. Chatzigeorgakidis, S. Karagiorgou, S. Athanasiou, and S. Skiadopoulos, "FML-kNN: scalable machine learning on big data using k-nearest neighbor joins," J. of Big Data, vol. 5, no. 1, pp. 1-27, Feb. 2018.
[24] S. Bagui, A. K. Mondal, and S. Bagui, "Improving the performance of kNN in the mapreduce framework using locality sensitive hashing," International J. of Distributed Systems and Technologies, vol. 10, no. 4, pp. 1-16, Oct. 2019.
[25] Z. Yong, L. Youwen, and X. Shixiong, "An improved KNN text classification algorithm based on clustering," J. of Computers, vol. 4, no. 3, pp. 230-237, Mar. 2009.
[26] H. Neeb and C. Kurrus, Distributed k-Nearest Neighbors, Ed: Stanford University Publishing: Stanford, CA, USA, 2016.
[27] Y. Ben-Haim and E. Tom-Tov, "A streaming parallel decision tree algorithm," J. of Machine Learning Research, vol. 11, no. 2, pp. 849-872, Feb. 2010.
[28] D. Xia, H. Li, B. Wang, Y. Li, and Z. Zhang, "A map reduce-based nearest neighbor approach for big-data-driven traffic flow prediction," IEEE Access, vol. 4, pp. 2920-2934, Jun. 2016.
[29] F. Angiulli, "Fast condensed nearest neighbor rule," in Proc. of the 22nd Int. Conf. on Machine learning, pp. 25-32, New York, NY, USA, 7-11Aug. 2005.
[30] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, "Efficient kNN classification algorithm for big data," Neurocomputing, vol. 195, no. 100, pp. 143-148, Jun. 2016.
[31] T. Seidl and H. P. Kriegel, "Optimal multi-step k-nearest neighbor search," SIGMOD Rec., vol. 27, no. 2, pp. 154-165, Jun. 1998.
[32] L. R. Lu and H. Y. Fa, "A density-based method for reducing the amount of training data in kNN text classification," J. of Computer Research and Development, vol. 45, no. 4, pp. 539–544, Aug. 2004.
[33] A. J. Gallego, J. Calvo-Zaragoza, J. J. Valero-Mas, and J. R. Rico-Juan, "Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation," Pattern Recognition, vol. 74, no. 1, pp. 531-543, Feb. 2018.
[34] A. J. Gallego, J. R. Rico-Juan, and J. J. Valero-Mas, "Efficient k-nearest neighbor search based on clustering and adaptive k values," Pattern Recognition, vol. 122, Article ID: 108356, Feb. 2022.
[35] S. Ougiaroglou and G. Evangelidis, "RHC: a non-parametric cluster-based data reduction for efficient k kNN classification," Pattern Analysis and Applications, vol. 19, no. 1, pp. 93-109, Feb. 2016.
[36] F. Wang, Q. Wang, F. Nie, W. Yu, and R. Wang, "Efficient tree classifiers for large scale datasets," Neurocomputing, vol. 284, no. 2, pp. 70-79, Apr. 2018.
[37] A. Hassanat, "Greedy algorithms for approximating the diameter of machine learning datasets in multidimensional euclidean space: experimental results," 2018.
[38] A. B. Hassanat, "Two-point-based binary search trees for accelerating big data classification using KNN," PloS One, vol. 13, no. 11, Article ID: e0207772, Nov. 2018.
[39] T. Liu, A. W. Moore, K. Yang, and A. G. Gray, "An investigation of practical approximate nearest neighbor algorithms," in Proc. 17th Int. Conf. on Information Processing Systems, pp. 825-832, Vancouver, Canada,13-18 Dec. 2004.