Recently after working out some kinks and rewriting the text storage engine I decided to push the dates for launch back to smooth some little quirks and quarks out… There has also been an overhaul of the news page happening and I have decided that rather than create news for the technology folk I would make news for everyone and just have a really cool/fun technology section for those of us nerds with downtime.
The current large issue with news is that I would like to use APIs to pull the news feed content; however, this is not always the case with some of the more reputable journals and blogs on the internet. I would have loved to pull content from BBC but have found that there is no public api for the news page, which makes sense after all since it is their revenue stream. I have decided that another day I will try to hack something together and see what I can dig up as far as an xml or json version of the feed goes. There must be something and with further inspection of their mobile app/site I should be able to find something.
Technology news page
As it currently stands the page could use a fair amount of work but I am content for the time being. The next step is to add a proper api caching layer, write the general/world news page then call it a day and move on to the more technical task of searching algorithms and searching algorithms and searching algorithms…
In between wanting to no longer live because of the searching algorithms I will add the search auto-complete functionality since that is pretty important I suppose…
Recently I have been doing a lot of work with APIs. I have so far being tied into youtube, flickr, wordnik, isohunt and a few other rather big ones. I love putting together code and having it just work. I also love working with external services and having proper and clean documentation laid out in front of me.
The reason I am creating this post is primarily because of the lack of standards I have seen when it comes to documentation. So far the best documentation has been from Youtube and Wordnik and the worst was from Isohunt. It was really just a form post with an example call.
Flicker had some issues for me when it came to documentation because in my eyes a web service should have four main requirements listed on each of the calls page.
- URL and method required to connect
- Example request
- Example response
- Parameters with flags for optional or required
All of these things are rather simple and are the core requirements when it comes to creating a rest service call. All of the following should be displayed. It also helps to show response types as xml, json or jsonp when displaying examples.
I have run across two rather annoying things when working with these various APIs. First is that Yahoo needs to show that you can request json by adding a parameter to the query on the API’s method page. The second issue that I came across was the the YouTube object variable names.
The YouTube Api, in their most recent version, has used dollar signs in the variable name… This is a huge pain when working in php because it breaks the variable and causes a runtime error. The solution to get around this is rather simple but still, it should not be required to call an object property in a rather ‘hacky’ manner.
The only other complaint that I have, which you should correct in the next version of your API, is that variables need to be consistently named. This was an annoyance when working with the Isohunt API. All of the variables started with a lower-case first character except for one. I don’t care if your variables have been concatenated with an underscore or use camel case, just make them consistent please.
die(“EOR: End of rant.”);
I forgot to mention this in the last post and seeing as it is a pretty big update that was added I figured that I should mention it. I have created an API to allow specified people access to the crawler and the database that drives the search technology. This will allow selected people to perform searches, add items to the queue, get host information by searching and so much more.
Because this api is pretty locked down and the other do not have access to it I am offering a limited version that I am working on to the outside world. It is not currently finished but if you would like to take a look at it send an email to caleb [dot] jonasson at gmail and make the subject “lirkrawler api access” It will help if you tell me why you are interested in getting your hands on the api.
Note that if people start abusing the system or performing large database hits with insane limits I am just going to turn off their access since this project is currently on a home server.
A lot has changed with the web crawler since I last posted an update. Even though the updates have been scarce and few the project has had nightly revision changes even if I am swamped with other projects; there is always time for the crawler. In this post I will be talking about four main updates; hardware, storage, threading, and an intelligent queue.
The main upgrade was to the hardware of the server. Upon discovering that the key efficiency was consistently less than 95% I had to think of a way to either make it more efficient or find a way to increase the buffer size. The solution to this was backing up the database and moving it off of the server. Once this was done I nuked mysql from the server and made that box a dedicated apache box. Then came the fun of ordering a new mysql server. The new box contains 2x 2TB hdd that are currently in raid 1. The box also has 32GB of memory; which in theory should get rid of my key efficiency issues. Now it was time to move the database onto the new box and get things rolling again.
I realized that how I was storing the information was unrealistic; a colleague of mine recommended that I store the pages in a file system rather than in a database but this didn’t really appeal to me. I would rather parse the page line by line and store a linked page than move to a file system. The solution for now is to hash the pages and make sure that I was not storing the same page twice. The page nid is then linked back to the url that was crawled. This seams to have made everything much more efficient especially since while crawling websites with user generated content there are a lot of 404 pages or pages that display the same thing as another page that I may have crawled in the past. With twitter alone there were roughly 176,000 duplicate pages that were stored in the database and 780,000+ pages from amazon that were duplicates.
Probably the best thing that I added to the crawler was the pseudo threading process. Because the application is PHP based the crawling process starts on a heartbeat that occurs every 15 seconds. This heart beat kicks off multiple crawl scripts and thus the process begins. Before I had the threading enabled I had to set the heartbeat to about 5 minutes. This is because if the crawl took longer than expected and the database selects and inserts queued up the system would snowball and everything would cease to work efficiently and properly. Keep in mind that there wasn’t new mysql database at this time.
I created a table that handles instances of each crawl and allows handling of how many tasks are currently running. These threads can be stopped, started, and locked which allows for a simple way to turn off the crawler or if testing only turn on one thread at a time. Each of the threads contains information for which url it is currently crawling which destroys the chance of a duplicate crawl happening at the same time.
Because there is a finite number of simultaneous page crawls happening at one time I can be sure that the even if there is a queue on the mysql server it will never get out of hand. Also; any snowballing will be handled and thus it is impossible.
The Intelligent Queue
There is now a proper queue that is being used by the system now! This queue has some rules to it that allow for better crawling of the web. First; it can not exceed 30,000 urls. This is because I want selects from the table to be swift and I don’t want the queue to become to large. Another rule is that the queue can only be populated from the queue generator class. This class gets candidates based on some set rules that can be configured and changed. Currently it is favouring uncrawled home pages of websites, so websites with a Uri of ‘/’. After that is favours websites with a host that has a very high combined link back rate. Currently the highest website in link backs is twitter… No surprise there. The second highest is tumblr.
This process of generating the queue may not be the most efficient way of crawling the best content on the web and this is something that I have monitored and am aware of. Because it isn’t the best content on the web; mainly social media websites I put some limitations on adding x amount of the same host to the queue. After this was done I hard coded wikipedia’s host nid into the queue generation to make sure that I was getting some sane and pretty reliable content to crawl.
Recently my crawling attempts came to a very swift end. I may filled up dreamhosts mysql server ‘madal’ which caused multiple peoples websites to go down until the issues were resolved. Unfortunately for me the resolving meant them moving my mysql information to a temporary location until I contacted customer support.
Turns out that their ‘unlimited’ policy only allows for 3GB of mysql usage rather than the 109GB of indexed websites I had sitting on there. This means that I had to find another solution for my problem.
Solution A: Dump archives of the indexes, dump raw content, dump backups and try to work with virtually no storage.
Solution B: Go and buy a mini form factor computer, pull the dvd drive out and replace it with a second hard disk. Turn the computer into a home server and start the process again with a better code design behind it.
I went with solution B because it allowed me to have a new toy and work with my own system. The server has an i3 putting out just over 3.0Ghz, 3TB total of hdd and 8GB of ram. This is hooked up via gigabit and allows me to better work things out and write some of the processing scripts in C++. This way I can constantly crunch information.
I don’t mean to hate on dreamhost. They really are an excellent company and they are pretty immediate when it comes to answering your problems for help. The so called unlimited policy is on the retarded side of things but for what I pay annually I can’t complain.
stored urls: 1.8m
checked urls: 56m
url count: 27m
I ran into a bit of a snag when my crawler hit a server that included a straight media file. The solution for this case which caused an unhandled timeout was to simply add in a context that was created.
$context = stream_context_create(array(‘http’ => array(‘timeout’ => 1)));
file_get_contents(“http://cjonasson.com/“, 0, $context);
I added a few counters to the interface but they are taxing the database a little bit to much. They are updating and they do look really good but they should only update once every 15 seconds rather than updating 10 or plus times per second.
A while back I started to work on the crawler and I am about to start working on it once again. This time I will be optimizing, storing the information and logging the speed of the database transaction to test how the cloud hosting service is handling this many connections.
Thank you for the unlimited bandwidth and transfer dreamhost because I will be making good use of your service tonight. I don’t plan on turning off this service and I would like to add in some form of multi tasking on the server. or at least increasing the connections made per second. So far it is indexing 10 sites a second but I think I could easily get that up to 50 with a little bit of work and a little lack of sleep.