Creating A Search Engine: Crawler Basics & Challenges

The first step required to index HTML pages is to fetch all the pages to be indexed using a crawler. One way to collect URLs is to scan collected pages for hyperlinks (outlinks) to other pages that have not yet been indexed. Indeed crawling in this fashion might never halt. It is quite simple to write a basic crawler, but a great deal of engineering goes into industry-strength crawlers. The central functions of an industry-strength crawler is to fetch many pages at the same time, so as to overlap the delays involved in:-

Resolving the hostname in the URL to an IP address using DNS
Connecting a socket to the server and sending the request
Receiving the requested page

Here are a few concerns that we will face while engineering a large scale crawler

Since a a single page fetch may involve several seconds of network latency, it is essential to fetch many pages at the same time to utilize the network bandwidth available
Many simultaneous fetches are possible only if the DNS lookup is streamlined to be highly concurrent, possibly replicated on a few DNS servers
Multiprocessing and multithreading will cause overheads. This is rectified by explicitly coding the state for a fetch context in a data structure and by suing asynchronous sockets, which would not block the process or thread using it.
Eliminate redundant URL fetches and avoid spider traps.

Creating A Search Engine

Friday, August 24, 2007

Crawler Basics & Challenges

No comments:

Links Related To Crawlers

Blog Archive

Great Blogs

Other Tweaky Links

About me