The main goal of focused web crawlers is to retrieve as many relevant pages as possible. However, most of the crawlers use page rank algorithm to lineup the pages in the crawler frontier. Since the page rank algorithm suffers from the drawback of “Richer get rich phenomenon”, focused crawlers often fail to retrieve the hidden relevant pages. This paper presents a novel approach for retrieving the hidden and relevant pages by combining rank and semantic similarity information. The model is validated by crawling the real web with different topics and the results are promising.
K. Pavani and Dr. Sajeev G. P., “A Novel Web Crawling Method for Vertical Search Engines”, in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2017.