A lot has changed with the web crawler since I last posted an update. Even though the updates have been scarce and few the project has had nightly revision changes even if I am swamped with other projects; there is always time for the crawler. In this post I will be talking about four main updates; hardware, storage, threading, and an intelligent queue.
The main upgrade was to the hardware of the server. Upon discovering that the key efficiency was consistently less than 95% I had to think of a way to either make it more efficient or find a way to increase the buffer size. The solution to this was backing up the database and moving it off of the server. Once this was done I nuked mysql from the server and made that box a dedicated apache box. Then came the fun of ordering a new mysql server. The new box contains 2x 2TB hdd that are currently in raid 1. The box also has 32GB of memory; which in theory should get rid of my key efficiency issues. Now it was time to move the database onto the new box and get things rolling again.
I realized that how I was storing the information was unrealistic; a colleague of mine recommended that I store the pages in a file system rather than in a database but this didn’t really appeal to me. I would rather parse the page line by line and store a linked page than move to a file system. The solution for now is to hash the pages and make sure that I was not storing the same page twice. The page nid is then linked back to the url that was crawled. This seams to have made everything much more efficient especially since while crawling websites with user generated content there are a lot of 404 pages or pages that display the same thing as another page that I may have crawled in the past. With twitter alone there were roughly 176,000 duplicate pages that were stored in the database and 780,000+ pages from amazon that were duplicates.
Probably the best thing that I added to the crawler was the pseudo threading process. Because the application is PHP based the crawling process starts on a heartbeat that occurs every 15 seconds. This heart beat kicks off multiple crawl scripts and thus the process begins. Before I had the threading enabled I had to set the heartbeat to about 5 minutes. This is because if the crawl took longer than expected and the database selects and inserts queued up the system would snowball and everything would cease to work efficiently and properly. Keep in mind that there wasn’t new mysql database at this time.
I created a table that handles instances of each crawl and allows handling of how many tasks are currently running. These threads can be stopped, started, and locked which allows for a simple way to turn off the crawler or if testing only turn on one thread at a time. Each of the threads contains information for which url it is currently crawling which destroys the chance of a duplicate crawl happening at the same time.
Because there is a finite number of simultaneous page crawls happening at one time I can be sure that the even if there is a queue on the mysql server it will never get out of hand. Also; any snowballing will be handled and thus it is impossible.
The Intelligent Queue
There is now a proper queue that is being used by the system now! This queue has some rules to it that allow for better crawling of the web. First; it can not exceed 30,000 urls. This is because I want selects from the table to be swift and I don’t want the queue to become to large. Another rule is that the queue can only be populated from the queue generator class. This class gets candidates based on some set rules that can be configured and changed. Currently it is favouring uncrawled home pages of websites, so websites with a Uri of ‘/’. After that is favours websites with a host that has a very high combined link back rate. Currently the highest website in link backs is twitter… No surprise there. The second highest is tumblr.
This process of generating the queue may not be the most efficient way of crawling the best content on the web and this is something that I have monitored and am aware of. Because it isn’t the best content on the web; mainly social media websites I put some limitations on adding x amount of the same host to the queue. After this was done I hard coded wikipedia’s host nid into the queue generation to make sure that I was getting some sane and pretty reliable content to crawl.