Web Robot Detection Using Fuzzy Rough Set Theory
Subject Areas : electrical and computer engineeringS. Rahimi 1 , J. Hamidzadeh 2 *
1 -
2 - دانشگاه صنعتی سجاد
Keywords: Web log file pre-processing web robot detection web visitors', session identification fuzzy rough set theory,
Abstract :
Web robots are software programs that traverse the internet autonomously. Their most important task is to fetch information and send it to the origin server. The high consumption of network bandwidth by them and server performance reduction, have caused the web robot detection problem. In this paper, fuzzy rough set theory has been used for web robot detection. The proposed method includes 4 phases. In the first phase, user sessions have identified using fuzzy rough set clustering. In the second phase, a vector of 10 features is extracted for each session. In the third phase, the identified sessions are labeled using a heuristic method. In the fourth phase, these labels are improved using fuzzy rough set classification. The proposed method performance has been evaluated on a real world dataset. The experimental results have been compared with state-of-the-art methods, and show the superiority of the proposed method in terms of F-measure.
[1] D. Doran and S. S. Gokhale, "Web robot detection techniques: overview and limitations," Data Min Knowl Disc, vol. 22, no. 1-2, pp. 183-210, Jan. 2011.
[2] N. Algiriyage, S. Jayasena, G. Dias, A. Perera, and K. Dayananda, "Identification and characterization of crawlers through analysis of web logs," in Proc. IEEE 8th Int. Conf. on Industrial and Information Systems, ICIIS'13, pp. 150-155. Dec. 2013.
[3] J. Patel and H. Jethva, "Web crawling," International J. of Innovations & Advancement in Computer Science, vol. 4, pp. 228-235, May 2015.
[4] A. Stassopoulou and M. D. Dikaiakos, "Web robot detection: a probabilistic reasoning approach," Computer Networks, vol. 53, no. 3, pp. 265-278, Feb. 2009.
[5] D. Stevanovic, A. An, and N. Vlajic, "Feature evaluation for web crawler detection with data mining techniques," Expert Systems with Applications, vol. 39, no. 10, pp. 8707-8717, Aug. 2012.
[6] D. Stevanovic, N. Vlajic, and A. An, "Detection of malicious and non-malicious website visitors using unsupervised," Applied Soft Computing, vol. 13, no. 1, pp. 698-708, Jan. 2013.
[7] D. Doran, Detection, Classification, and Workload Analysis of Web Robots, University of Connecticut, 2014.
[8] T. H. Sardar and Z. Ansari, "Detection and confirmation of web robot requests for cleaning the voluminous web log data," in Proc. IEEE Int. Conf. on the IMpact of E-Technology on US, IMPETUS'14pp. 13-19, Jan. 2014.
[9] Q. Bai, G. Xiong, Y. Zhao, and L. He, "Analysis and detection of bogus behavior in web crawler measurement," Procedia Computer Science, vol. 31, pp. 1084-1091, Dec. 2014.
[10] D. Doran, K. Morillo, and S. S. Gokhale, "A comparison of web robot and human requests," in Proc. of the IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining, ACM, pp. 1374-1380, Aug. 2013.
[11] M. D. Dikaiakosa, A. Stassopouloub, and L. Papageorgioua, "An investigation of web crawler behavior: characterization and metrics," Computer Communications, vol. 28, no. 8, pp. 880-897, May 2005.
[12] Z. Chu, S. Gianvecchio, A. Koehl, H. Wang, and S. Jajodia, "Blog or block: detecting blog bots through behavioral biometrics," Computer Networks, vol. 57, no. 3, pp. 634-646, Feb. 2013.
[13] D. Zhang, D. Zhang, and X. Liu, "A novel malicious web crawler detector: performance and evaluation," IJCSI International J. of Computer Science Issues, vol. 10, no. 1, pp. 121-126, Jan. 2013.
[14] I. Ghafir and V. Prenosil, "DNS traffic analysis for malicious domains detection," in Proc. 2nd Int. Conf. on Signal Processing and Integrated Network,s SPIN'15, pp. 613-918, Feb. 2015.
[15] M. Zabihi, M. V. Jahan, and J. Hamidzadeh, "A density based clustering approach to distinguish between web robot and human requests to a web server," The ISC Int'l J. of Information Security, vol. 6, no. 1, pp. 1-13, Jan. 2014.
[16] Z. Pawlak, "Rough sets," International J. of Computer and Information Sciences, vol. 11, no. 5, pp. 341-356, Oct. 1982.
[17] A. Anitha, "An efficient agglomerative clustering algorithm for web navigation pattern identification," Circuits and Systems, vol. 7, no. 9, pp. 2349-2356, Jul. 2016.
[18] R. Sadeghi and J. Hamidzadeh, "Automatic support vector data description," Soft Computing, 12 pp., 2016, DOI s00500-016-2317-5.
[19] K. Thangavel and R. Roselin, "Fuzzy-rough feature selection with Π-membership function for mammogram classification," International J. of Computer Science Issues, vol. 9, no. 4, pp. 361-370, May 2012.
[20] A. Zeng, T. Li, D. Liu, J. Zhang, and H. Chen, "A fuzzy rough set approach for incremental feature selection on hybrid information systems," Fuzzy Sets and Systems, vol. 258, pp. 39-60, Jan. 2015.
[21] N. Verbiest, Fuzzy Rough and Evolutionary Approaches to Instance Selection, Doctoral Dissertation, Ghent University, 2014.
[22] N. Verbiest, C. Cornelis, and F. Herrera, "FRPS: a fuzzy rough prototype selection method," Pattern Recognition, vol. 46, no. 10, pp. 2770-2782, Oct. 2013.
[23] J. Hamidzadeh, M. Zabihimayvan, and R. Sadeghi, "Detection of Web site visitors based on fuzzy rough sets," Soft Computing, 14 pp., 2016, DOI s00500-016-2476-4.
[24] D. U. Maheswari and A. Marimuthu, "An ensemble fuzzy rough set jaccard similarity measure based approach on user session clustering," International J. of Computer Systems, vol. 3, no. 4, pp. 330-334, Apr. 2016.
[25] T. V. Kumar and H. Guruprasad, "Clustering of web usage data using fuzzy tolerance rough set similarity and table filling algorithm," Cancer Research and Oncology, vol. 1, no. 3, pp. 143-152, Jun. 2013.
[26] D. S. Sisodia, S. Verma, and O. P. Vyas, "Agglomerative approach for identification and elimination of web robots from web server logs to extract knowledge about actual visitors," J. of Data Analysis and Information Processing, vol. 3, no. 1, pp. 1-10, Apr. 2015.
[27] W. Dong, et al., "Web robot detection with semi-supervised learning method," in Proc. 3rd Int. Conf. on Material, Mechanical and Manufacturing Engineering, IC3ME'15, pp. 2123-2128, 2015.
[28] G. Suchacka and M. Sobkow, "Detection of internet robots using a bayesian approach," in Proc. 2nd IEEE Int. Conf. on Cybernetics, CYBCONF'15, pp. 365-370, Jun. 2015.
[29] T. Grzinic, L. Mrsic, and J. Saban, Lino-An Intelligent System for Detecting Malicious Web-Robots, Intelligent Information and Database Systems, Springer International Publishing, pp. 559-568, 2015.
[30] A. M. Radzikowska and E. E. Kerre, "A comparative study of fuzzy rough sets," Fuzzy Sets and Systems, vol. 126, no. 2, pp. 137-155, Mar. 2002.
[31] W. Cohen, P. Ravikumar, and S. E. Fienberg, "A comparison of string distance metrics for name-matching tasks," in Proc. American Association for Artificial Intelligence, IIWeb'03, pp. 73-78, Acapulco, Mexico, 9-10 Aug. 2003.
[32] W. H. Gomaa and A. A. Fahmy, "A survey of text similarity approaches," International J. of Computer Applications, vol. 68, no. 13, pp. 13-18, Jan. 2013.
[33] M. A. Jaro, "Probabilistic linkage of large public health data files," Statistics in Medicine, vol. 14, no. 5-7. pp. 491–498, Apr. 1995..
[34] List of User-Agents (Spiders, Robots, Browser), Retrieved from http://www.user-agents.org and www.UserAgentString.com, 2015.
[35] E. Alpaydin, Introduction to Machine Learning, MIT Press, 2014.
[36] S. Arlot and A. Celisse, "A survey of cross-validation procedures for model selection," Statistics Surveys, vol. 4, pp. 40-79, 2010.
[37] S. Cifci, Y. Ekinci, G. Whyatt, A. Japutra, S. Molinillo, and H. Siala, "A cross validation of consumer-based brand equity models: driving customer equity in retail brands," J. of Business Research, vol. 69, no. 9, pp. 3740-3747, Sept. 2016.
[38] J. Hamidzadeh, R. Monsefi, and H. S. Yazdi, "IRAHC: instance reduction algorithm using hyperrectangle clustering," Pattern Recognition, vol. 48, no. 5, pp. 1878-1889, May 2015.