توزیع مؤثر اسناد برای ایجاد توازن بار بین سرورها با استفاده از شمارش رخداد کلمات در سابقه پرس‌وجوها

محورهای موضوعی : مهندسی برق و کامپیوتر

سیده ریحانه تراب جهرمی ¹ , سجاد ظریف زاده ^{2
*}

1 - دانشگاه یزد
2 - دانشگاه یزد

تاریخ دریافت : 1397/10/05 تاریخ پذیرش : 1398/05/06 تاریخ انتشار : 1398/08/26

کلید واژه: توازن بارتوزیع سندسابقه پرس‌وجوموتور جستجو,

چکیده مقاله :

هدف اصلی موتورهای جستجو، یافتن مرتبط‌ترین نتایج نسبت به پرس‌وجوی کاربر در سریع‌ترین زمان ممکن است. صفحات خزش‌شده توسط موتور جستجو بین سرورهای متعددی توزیع می‌شوند تا در هنگام جستجو بتوان از قدرت بازیابی و پردازش موازی آنها برای تولید سریع‌تر پاسخ استفاده نمود. با توجه به تعداد بسیار زیاد صفحات وب، موتورهای جستجو سیاست‌های مختلفی را برای توزیع مناسب اسناد بین سرورها انتخاب می‌کنند. در این مقاله، روش جدیدی برای توزیع اسناد پیشنهاد می‌شود که هدف آن ایجاد توازن بار کاری بین سرورها برای کاهش زمان پاسخ‌گویی موتور جستجو می‌باشد. ایده اصلی، استفاده از پرس‌وجوهای قبلی کاربران است بدین ترتیب که به هر کلمه از کلمات موجود در سابقه پرس‌وجو بر حسب تعداد رخداد روزانه آن، وزنی نسبت داده می‌شود. سپس هر سند با توجه به مجموع وزن کلمات داخل آن، وزن‌دهی می‌شود که این وزن ارتباط مستقیمی با احتمال انتخاب آن سند به عنوان پاسخ یک پرس‌وجو دارد. در نهایت، اسناد به نحوی بین سرورها توزیع می‌شوند که وزن اسناد داخل هر یک از سرورها برابر باشد. نتایج ارزیابی با استفاده از داده واقعی نشان می‌دهند که روش پیشنهادی قادر است توازن بار سرورها را مخصوصاً در زمان اوج ورود پرس‌وجوها بیش از 20% نسبت به روش‌های گذشته بهبود بخشد.

چکیده انگلیسی:

The main goal of web search engines is to find the most relevant results with respect to the user query in a shortest possible time. To do so, the crawled documents have to be partitioned between several servers in order to use their aggregate retrieval and processing power. The search engines use different policies for efficient partitioning of documents. In this paper, we propose a new document partitioning method that intends to balance the load between servers to reduce the response time of queries. The idea is to weigh each term based on its daily frequency in log of past queries. We then assign a weight to each document via summing the weight of its substituent terms. The weight of a document approximates the likelihood of its presence in future search results. Finally, the documents are partitioned between servers in a way that the sum of document weights in each server becomes roughly equal. Our evaluation results show that the proposed method is able to balance the load by about 20% better than former algorithms, especially in the peak of search engine traffic.

منابع و مأخذ:

[1] B. B. Cambazoglu, E. Kayaaslan, S. Jonassen, and C. Aykanat, "A term-based inverted index partitioning model for efficient distributed query processing," ACM Trans. Web, vol. 7, no. 3, pp. 15-23, Sept. 2013.
[2] Y. C. Ma, C. P. Chung, and T. F. Chen, "Load and storage balanced posting file partitioning for parallel information retrieval," J. Syst. Softw., vol. 84, no. 5, pp. 864-884, May 2011.
[3] A. Kulkarni, A. S. Tigelaar, D. Hiemstra, and J. Callan, "Shard ranking and cut off estimation for topically partitioned collections," in Proc. of the 21st ACM Int. Conf. on Information and Knowledge Management, pp. 555-564, Hawaii, USA, 1-2 Nov. 2012.
[4] A. Moffat, W. Webber, J. Zobel, and R. Baeza-Yates, "A pipelined architecture for distributed text query evaluation," Inf. Retr. Boston., vol. 10, no. 3, pp. 205-231, Jun. 2007.
[5] A. Kulkarni, J. Teevan, K. M. Svore, and S. T. Dumais, "Understanding temporal query dynamics," in Proc. of the 4th ACM Int. Conf. on Web Search and Data Mining, WSDM'11, pp. 167-176, Hong Kong, China, 9-12 Feb. 2011.
[6] K. M. Risvik and R. Michelsen, "Search engines and web dynamics," Comput. Networks, vol. 39, no. 3, pp. 289-302, Jun. 2002.
[7] S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," Computer Networks ISDN Syst., vol. 30, no. 1-7, pp. 107-117, Apr. 1998.
[8] A. Moffat, W. Webber, and J. Zobel, "Load balancing for term-distributed parallel retrieval," in Proc. of the 29th Annual ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, pp. 348-355, Seattle, USA, 6-11Aug. 2006.
[9] D. Puppin, F. Silvestri, R. Perego, and R. Baeza-Yates, "Tuning the capacity of search engines: load-driven routing and incremental caching to reduce and balance the load," ACM Trans. Inf. Syst., vol. 28, no. 2, pp. 1-36, May 2010.
[10] C. Lucchese, S. Orlando, R. Perego, and F. Silvestri, "Mining query logs to optimize index partitioning in parallel web search engines," in Proc. of the 2nd Int. Conf. on Scalable Information Systems, pp. 43-52, Suzhou, China, 6-8 Jun. 2007.
[11] A. Kulkarni and J. Callan, "Selective search: efficient and effective search of large textual collections," ACM Trans. Inf. Syst., vol. 33, no. 4, pp. 1-33, Apr. 2015.
[12] A. Kulkarni, A. S. Tigelaar, D. Hiemstra, and J. Callan, "Shard ranking and cutoff estimation for topically partitioned collections," in Proc. of the 21st ACM International Conf. on Information and Knowledge Management, pp. 555-564, Hawaii, USA, 1-2 Nov. 2012.
[13] Y. Kim, J. Callan, J. S. Culpepper, and A. Moffat, "Load-balancing in distributed selective search," in Proc. of the 39th ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, pp. 905-908, Pisa, Italy, 17-21 Jul. 2016.
[14] Z. Dai, C. Xiong, and J. Callan, "Query-biased partitioning for selective search," in Proc. of the 25th ACM International Conf. on Information and Knowledge Management, pp. 1119-1128, New York, USA, 24-28 Oct. 2016.
[15] Y. Kim, J. Callan, J. S. Culpepper, and A. Moffat, "Efficient distributed selective search," Information Retrieval J., vol. 20, no. 3, pp. 221-252, Jun. 2017.
[16] A. Kane and F. W. Tompa, "Small-term distribution for disk-based search," in Proc. of the ACM Symposium on Document Engineering, New York, USA, 4-7 Sept. 2017.
[17] S. Jonassen and S. E. Bratsberg, "Impact of the query model and system settings on performance of distributed inverted indexes," in Proc. of the 22nd Norwegian Informatics Conf., NIK'09, pp. 143-154, Oslo, Norway, 25-27 Nov. 2009.
[18] B. Ribeiro-Neto, E. S. Moura, M. S. Neubert, and N. Ziviani, "Efficient distributed algorithms to build inverted files," in Proc. of the 22nd Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 105-112, Berkeley, USA, 15-19 Aug. 1999.
[19] Z. Dai, C. Xiong, and J. Callan, "Query-biased partitioning for selective search," in Proc. of the 25th ACM Int. on Conf. on Information and Knowledge Management, pp. 1119-1128, New York, USA, 24-28 Oct. 2016.
[20] H. Patel, Inverted Index Partitioning Strategies for a Distributed Search Engine, University of Waterloo, 2010.
[21] Z. Gyongyi and H. Garcia-Molina, Web Spam Taxonomy, Stanford University, 2005.
[22] J. Zhou and T. Yang, "Selective early request termination for busy internet services," in Proc. of the 15th Int. Conf. on World Wide Web-WWW'06, pp. 605-614, Edinburgh, Scotland, 23-26 May 2006.
[23] K. Sparck Jones, "A statistical interpretation of term specificity and its application in retrieval," J. Doc., vol. 28, no. 1, pp. 11-21, Jan. 1972.
[24] Apache Lucene, Welcome to Apache Lucene, Apache Foundation, 2018. [Online]. Available: http://lucene.apache.org/.

مقالات مرتبط

یک رهیافت فرااکتشافی چندهدفه برای بهبود پوشش و اتصال در شبکه‌های حسگر بی‌سیم
تاریخ چاپ : 1405/02/22
رویکرد ارزیابی هیجان نوین جهت مراقبت از سرطان مبتنی بر مدل‌های زبانی بزرگ
تاریخ چاپ : 1405/02/22
ارائه روشی برای مدیریت منابع در شبکه‌های Fog-DSDN با بهره‌گیری از معماری میکروسرویس و شبکه‌های ESN
تاریخ چاپ : 1405/02/22
چارچوب ترکیبی سبک‌وزن برای امنیت اینترنت اشیا با استفاده از جنگل تصادفی بهینه و انتخاب ویژگی تطبیقی در معماری لبه-ابری
تاریخ چاپ : 1405/02/22
یک چارچوب یادگیری نیمه‌نظارتی جهت دسته‌بندی دقیق موارد آزمون با بهره‌گیری از تعبیه‌های زبانی و ویژگی‌های معنایی متن
تاریخ چاپ : 1405/02/22
تکنیک هوشمند مبتنی بر الگوریتم چتر دریایی برای زمان‌بندی وظایف بر اساس اولویت در شبکه‌های IoT/Fog
تاریخ چاپ : 1405/02/22

اشتراک گذاری

آدرس مقاله

توزیع مؤثر اسناد برای ایجاد توازن بار بین سرورها با استفاده از شمارش رخداد کلمات در سابقه پرس‌وجوها