The first step required to index HTML pages is to fetch all the pages to be indexed using a crawler. One way to collect URLs is to scan collected pages for hyperlinks (outlinks) to other pages that have not yet been indexed. Indeed crawling in this fashion might never halt. It is quite simple to write a basic crawler, but a great deal of engineering goes into industry-strength crawlers. The central functions of an industry-strength crawler is to fetch many pages at the same time, so as to overlap the delays involved in:-
- Resolving the hostname in the URL to an IP address using DNS
- Connecting a socket to the server and sending the request
- Receiving the requested page
Here are a few concerns that we will face while engineering a large scale crawler
- Since a a single page fetch may involve several seconds of network latency, it is essential to fetch many pages at the same time to utilize the network bandwidth available
- Many simultaneous fetches are possible only if the DNS lookup is streamlined to be highly concurrent, possibly replicated on a few DNS servers
- Multiprocessing and multithreading will cause overheads. This is rectified by explicitly coding the state for a fetch context in a data structure and by suing asynchronous sockets, which would not block the process or thread using it.
- Eliminate redundant URL fetches and avoid spider traps.