Lirkr has had a recent face lift from the old.
Over the past few weeks I have been working on getting the ui rewritten in angular along with using Twitters Bootstrap as the style foundation. This was important for a few reasons; mainly rapid iterations and mobile support.
Lirkr Mobile Supported News View using AngularJs and Twitter’s Bootstrap
First, it allows better caching of content loaded from the server and thus frees up the server to do more with less. Rather than returning recent searches I can simply store them in the application for later referencing and this is done quite easily thanks to angular.
Secondly, with angular everything is state based thanks to ui-route, a library that plugs into angular. Ui-route will allow the application to determine what state to use based on the url path. It also supports url variables, case handling for specific urls. Ie: default a certain directory to a specific state.
A heads up warning should be mentioned in regards to angular js. It isn’t the quickest thing to pick up as it does require you to wrap your head around best practices which can be flaky in the documentation. There are regular cases where one page will tell you the best practice is one way and the following page will say the complete opposite.
Bootstrap was a good choice was deciding what to include in the client tech stack changes. It allows simple and easy desktop, tablet and mobile support. It’s as simple as adding classes to your containers and letting bootstrap work it’s magic.
Bootstrap has some lovely documentation with the addition of generating css that contains what you want and removes what you won’t need. By using this tool you can trim what isn’t needed right out of the gate and get yourself a smaller css sheet.
You can view the changes to the website here.
The second big change is the url paths. I went from having ‘/news/tech’ to having ‘/#/news/tech’ This was caused by the urls that are required in order to drive the angular states. Even with the added vhost changes I couldn’t see a way to keep the old url scheme.
Recently after working out some kinks and rewriting the text storage engine I decided to push the dates for launch back to smooth some little quirks and quarks out… There has also been an overhaul of the news page happening and I have decided that rather than create news for the technology folk I would make news for everyone and just have a really cool/fun technology section for those of us nerds with downtime.
The current large issue with news is that I would like to use APIs to pull the news feed content; however, this is not always the case with some of the more reputable journals and blogs on the internet. I would have loved to pull content from BBC but have found that there is no public api for the news page, which makes sense after all since it is their revenue stream. I have decided that another day I will try to hack something together and see what I can dig up as far as an xml or json version of the feed goes. There must be something and with further inspection of their mobile app/site I should be able to find something.
Technology news page
As it currently stands the page could use a fair amount of work but I am content for the time being. The next step is to add a proper api caching layer, write the general/world news page then call it a day and move on to the more technical task of searching algorithms and searching algorithms and searching algorithms…
In between wanting to no longer live because of the searching algorithms I will add the search auto-complete functionality since that is pretty important I suppose…
I forgot to mention this in the last post and seeing as it is a pretty big update that was added I figured that I should mention it. I have created an API to allow specified people access to the crawler and the database that drives the search technology. This will allow selected people to perform searches, add items to the queue, get host information by searching and so much more.
Because this api is pretty locked down and the other do not have access to it I am offering a limited version that I am working on to the outside world. It is not currently finished but if you would like to take a look at it send an email to caleb [dot] jonasson at gmail and make the subject “lirkrawler api access” It will help if you tell me why you are interested in getting your hands on the api.
Note that if people start abusing the system or performing large database hits with insane limits I am just going to turn off their access since this project is currently on a home server.
A lot has changed with the web crawler since I last posted an update. Even though the updates have been scarce and few the project has had nightly revision changes even if I am swamped with other projects; there is always time for the crawler. In this post I will be talking about four main updates; hardware, storage, threading, and an intelligent queue.
The main upgrade was to the hardware of the server. Upon discovering that the key efficiency was consistently less than 95% I had to think of a way to either make it more efficient or find a way to increase the buffer size. The solution to this was backing up the database and moving it off of the server. Once this was done I nuked mysql from the server and made that box a dedicated apache box. Then came the fun of ordering a new mysql server. The new box contains 2x 2TB hdd that are currently in raid 1. The box also has 32GB of memory; which in theory should get rid of my key efficiency issues. Now it was time to move the database onto the new box and get things rolling again.
I realized that how I was storing the information was unrealistic; a colleague of mine recommended that I store the pages in a file system rather than in a database but this didn’t really appeal to me. I would rather parse the page line by line and store a linked page than move to a file system. The solution for now is to hash the pages and make sure that I was not storing the same page twice. The page nid is then linked back to the url that was crawled. This seams to have made everything much more efficient especially since while crawling websites with user generated content there are a lot of 404 pages or pages that display the same thing as another page that I may have crawled in the past. With twitter alone there were roughly 176,000 duplicate pages that were stored in the database and 780,000+ pages from amazon that were duplicates.
Probably the best thing that I added to the crawler was the pseudo threading process. Because the application is PHP based the crawling process starts on a heartbeat that occurs every 15 seconds. This heart beat kicks off multiple crawl scripts and thus the process begins. Before I had the threading enabled I had to set the heartbeat to about 5 minutes. This is because if the crawl took longer than expected and the database selects and inserts queued up the system would snowball and everything would cease to work efficiently and properly. Keep in mind that there wasn’t new mysql database at this time.
I created a table that handles instances of each crawl and allows handling of how many tasks are currently running. These threads can be stopped, started, and locked which allows for a simple way to turn off the crawler or if testing only turn on one thread at a time. Each of the threads contains information for which url it is currently crawling which destroys the chance of a duplicate crawl happening at the same time.
Because there is a finite number of simultaneous page crawls happening at one time I can be sure that the even if there is a queue on the mysql server it will never get out of hand. Also; any snowballing will be handled and thus it is impossible.
The Intelligent Queue
There is now a proper queue that is being used by the system now! This queue has some rules to it that allow for better crawling of the web. First; it can not exceed 30,000 urls. This is because I want selects from the table to be swift and I don’t want the queue to become to large. Another rule is that the queue can only be populated from the queue generator class. This class gets candidates based on some set rules that can be configured and changed. Currently it is favouring uncrawled home pages of websites, so websites with a Uri of ‘/’. After that is favours websites with a host that has a very high combined link back rate. Currently the highest website in link backs is twitter… No surprise there. The second highest is tumblr.
This process of generating the queue may not be the most efficient way of crawling the best content on the web and this is something that I have monitored and am aware of. Because it isn’t the best content on the web; mainly social media websites I put some limitations on adding x amount of the same host to the queue. After this was done I hard coded wikipedia’s host nid into the queue generation to make sure that I was getting some sane and pretty reliable content to crawl.