Category Archives: Projects

lirkr mobile view

Lirkr Angularjs & Bootstrap Update

Lirkr has had a recent face lift from the old.

Over the past few weeks I have been working on getting the ui rewritten in angular along with using Twitters Bootstrap as the style foundation. This was important for a few reasons; mainly rapid iterations and mobile support.

lirkr mobile view

Lirkr Mobile Supported News View using AngularJs and Twitter’s Bootstrap

 

Angular

First, it allows better caching of content loaded from the server and thus frees up the server to do more with less. Rather than returning recent searches I can simply store them in the application for later referencing and this is done quite easily thanks to angular.

Secondly, with angular everything is state based thanks to ui-route, a library that plugs into angular. Ui-route will allow the application to determine what state to use based on the url path. It also supports url variables, case handling for specific urls. Ie: default a certain directory to a specific state.

A heads up warning should be mentioned in regards to angular js. It isn’t the quickest thing to pick up as it does require you to wrap your head around best practices which can be flaky in the documentation. There are regular cases where one page will tell you the best practice is one way and the following page will say the complete opposite.

Bootstrap

Bootstrap was a good choice was deciding what to include in the client tech stack changes. It allows simple and easy desktop, tablet and mobile support. It’s as simple as adding classes to your containers and letting bootstrap work it’s magic.

Bootstrap has some lovely documentation with the addition of generating css that contains what you want and removes what you won’t need. By using this tool you can trim what isn’t needed right out of the gate and get yourself a smaller css sheet.

Summary

You can view the changes to the website here.

There are some significant changes. The first big change is the lack of search engine optimization; there isn’t any. The website loads into states and the states are all javascript. Unless google uses the v8 engine and meta tags on the site are added, there will be no crawling.

The second big change is the url paths. I went from having ‘/news/tech’ to having ‘/#/news/tech’ This was caused by the urls that are required in order to drive the angular states. Even with the added vhost changes I couldn’t see a way to keep the old url scheme.

Startup Weekend 2013 Lessons

Startup Weekend Kamloops_580x390

Bring a decent machine to work on.

I attended the even with a surface pro loaded with Ubuntu and wi-fi dongle. Not an ideal solution when attempting to create a prototype to show the judges and the crowd. It didn’t help that when the power was plugged in the touch pad would send the cursor jumping across the screen even when my finger would not move. I learned later that the if I pushed down on the top left corner of the device this would correct the issue leading my to believe it was an issue with the device being properly grounded. On top of all of these issues I was working in vim for everything, an un-configured copy of vim and the keyboard in use lacked an insert key making pasting a rather annoying procedure that involved pushing down the left side of the tablet while right clicking to past.

Lord of the flies, the sequel…

When you hop into a group of people where everyone is a stranger of everyone, two things happen. The first thing that happens is there is an immediate and unnecessary power struggle between some of the people in the group. I found that a good cure for this is reminding people that there is a group leader followed by asking what the group leader thinks of the idea. Side note, it’s amazing how much buying you can get by asking people the correct questions. Never demand, simply let them find them understand your reasoning by asking the correct questions.

Come prepared.

If you want to have a really good time at a startup weekend make sure that you do some download preparation. Things like downloading your repo code ahead of time and making sure that you have all of your android SDKs up to date will make things magical when you don’t know what the internet will be like. The first 3 or so hours were spent just getting everything in line for one of the developers on the team. This could have been easily handled with some quick prep.

You aren’t allowed to pre code something before the event but you are allowed to hack something together with existing frameworks and libraries.

Book the next Monday off

I quickly learned that after spending 30 plus hours of solid development time in a 2 and a half day window is really not as enjoyable if you were to do the same thing at home. The added stress of making sure that what you contribute is done in a timely manner and you aren’t letting anyone down can take it’s toll on you when it’s 6:30 in the morning and you still haven’t figured out how to efficiently query a database of latitude and longitude coordinates within an N kilometre range.

Also, because I live a fair distance away and was sleep deprived I ended up missing the after party which would have been great to attend.

Commend efforts not just results

It is amazing what people can accomplish in a few days and even if there are less than ideal results it is important to make sure that they feel good about themselves. Startup Weekend in my opinion is all about learning new things, helping people out and connecting with people that you would not normally meet. It is important to give back and help out creative communities and part of this involves recognizing peoples feats and giving them encouragement to continue.

On a final note

Fake everything. Fake the presentation, fake the product, pad the numbers and make it look good.

We created an actual application, it worked on the back-end. Connected to the servers and handled the information coming back through the wire. We created a prototype not an MVP. A minimal viable product should be as simple as possible and just for demonstration purposes. It doesn’t need to work, it just needs to look like it does.

lirkr technology news screen cap

Lirkr is coming along… finally

Recently after working out some kinks and rewriting the text storage engine I decided to push the dates for launch back to smooth some little quirks and quarks out… There has also been an overhaul of the news page happening and I have decided that rather than create news for the technology folk I would make news for everyone and just have a really cool/fun technology section for those of us nerds with downtime.

The current large issue with news is that I would like to use APIs to pull the news feed content; however, this is not always the case with some of the more reputable journals and blogs on the internet. I would have loved to pull content from BBC but have found that there is no public api for the news page, which makes sense after all since it is their revenue stream. I have decided that another day I will try to hack something together and see what I can dig up as far as an xml or json version of the feed goes. There must be something and with further inspection of their mobile app/site I should be able to find something.

lirkr technology news screen cap

Technology news page

 

As it currently stands the page could use a fair amount of work but I am content for the time being. The next step is to add a proper api caching layer, write the general/world news page then call it a day and move on to the more technical task of searching algorithms and searching algorithms and searching algorithms…

In between wanting to no longer live because of the searching algorithms I will add the search auto-complete functionality since that is pretty important I suppose…

 

 

Particles: Isometrics

Currently the only real work that has been going on is updates to the crawler and getting all of the resources figured out. I have set aside a little bit of time to work on particles. Particles is a javascript engine that can easily be added into a phonegap build.

I have gone on in previous posts about the projects and the intentions I have with it. If you would like to read more about it then I suggest that you go here: Shameless project plug

I have started working on an isometric view that will allow scaling and in the future panning. All of this requires me to build an overhead grid that the isometric view and renderer will plug into.

If you are interested in the project then I suggest you fork a copy or get in contact with me. All of the information can be found on the repository wiki pages.

Forgotten Crawler Updates: API

I forgot to mention this in the last post and seeing as it is a pretty big update that was added I figured that I should mention it. I have created an API to allow specified people access to the crawler and the database that drives the search technology. This will allow selected people to perform searches, add items to the queue, get host information by searching and so much more.

Because this api is pretty locked down and the other do not have access to it I am offering a limited version that I am working on to the outside world. It is not currently finished but if you would like to take a look at it send an email to caleb [dot] jonasson at gmail and make the subject “lirkrawler api access” It will help if you tell me why you are interested in getting your hands on the api.

Note that if people start abusing the system or performing large database hits with insane limits I am just going to turn off their access since this project is currently on a home server.

Web Crawler Update: Hardware, Storage, Threading and The Intelligent Queue

A lot has changed with the web crawler since I last posted an update. Even though the updates have been scarce and few the project has had nightly revision changes even if I am swamped with other projects; there is always time for the crawler. In this post I will be talking about four main updates; hardware, storage, threading, and an intelligent queue.

Hardware

The main upgrade was to the hardware of the server. Upon discovering that the key efficiency was consistently less than 95% I had to think of a way to either make it more efficient or find a way to increase the buffer size. The solution to this was backing up the database and moving it off of the server. Once this was done I nuked mysql from the server and made that box a dedicated apache box. Then came the fun of ordering a new mysql server. The new box contains 2x 2TB hdd that are currently in raid 1. The box also has 32GB of memory; which in theory should get rid of my key efficiency issues. Now it was time to move the database onto the new box and get things rolling again.

Storage

I realized that how I was storing the information was unrealistic; a colleague of mine recommended that I store the pages in a file system rather than in a database  but this didn’t really appeal to me. I would rather parse the page line by line and store a linked page than move to a file system. The solution for now is to hash the pages and make sure that I was not storing the same page twice. The page nid is then linked back to the url that was crawled. This seams to have made everything much more efficient especially since while crawling websites with user generated content there are a lot of 404 pages or pages that display the same thing as another page that I may have crawled in the past. With twitter alone there were roughly 176,000 duplicate pages that were stored in the database and 780,000+ pages from amazon that were duplicates.

Threading

Probably the best thing that I added to the crawler was the pseudo threading process. Because the application is PHP based the crawling process starts on a heartbeat that occurs every 15 seconds. This heart beat kicks off multiple crawl scripts and thus the process begins. Before I had the threading enabled I had to set the heartbeat to about 5 minutes. This is because if the crawl took longer than expected and the database selects and inserts queued up the system would snowball and everything would cease to work efficiently and properly. Keep in mind that there wasn’t new mysql database at this time.

I created a table that handles instances of each crawl and allows handling of how many tasks are currently running. These threads can be stopped, started, and locked which allows for a simple way to turn off the crawler or if testing only turn on one thread at a time. Each of the threads contains information for which url it is currently crawling which destroys the chance of a duplicate crawl happening at the same time.

Because there is a finite number of simultaneous page crawls happening at one time I can be sure that the even if there is a queue on the mysql server it will never get out of hand. Also; any snowballing will be handled and thus it is impossible.

The Intelligent Queue

There is now a proper queue that is being used by the system now! This queue has some rules to it that allow for better crawling of the web. First; it can not exceed 30,000 urls. This is because I want selects from the table to be swift and I don’t want the queue to become to large. Another rule is that the queue can only be populated from the queue generator class. This class gets candidates based on some set rules that can be configured and changed. Currently it is favouring uncrawled home pages of websites, so websites with a Uri of ‘/’. After that is favours websites with a host that has a very high combined link back rate. Currently the highest website in link backs is twitter… No surprise there. The second highest is tumblr.

This process of generating the queue may not be the most efficient way of crawling the best content on the web and this is something that I have monitored and am aware of. Because it isn’t the best content on the web; mainly social media websites I put some limitations on adding x amount of the same host to the queue. After this was done I hard coded wikipedia’s host nid into the queue generation to make sure that I was getting some sane and pretty reliable content to crawl.

Shameless Project Plug

Git hub is where it is at these days and so is the mobile application market place. (or so I hear) I have been working diligently, 5 minutes a day, on a project called Particle. What was going to be a particle engine written in javascript turned into something much more than just a particle engine. It became a beast. A monster. A Javascript engine like all others.

The idea is to create an application framework for mobile application development using Javascript. I would like this project to tie into the all mighty Adobe-Apache Cordova Project, not officially of course, but enough hooks that will be configurable to easily be used with Phonegap.

Features

Particles will allow rapid development and prototyping of canvas based applications. Some of the features are:

  • Error logging through console and stored in the application as a stack allowing displaying of different levels.
  • Handlers that will allow different renders to the canvas. ie: Isometric, 2d, faux 3d, etc.
  • Easy set-up for browser application, or mobile.
  • Configurable resource loading.
  • Application state handling.
  • Dynamic canvas resizing.

Code Repository

All of the code for the application will be stored on Github. Which can be accessed here. If you would like to partake or use the code base feel free to use the fork although I am looking for people to work on the original code base.

Particles Code Repo

Sumsumsum.com updates

If you are following along with the recent posts of what has been going on around here then you will know that I have been on a bit of a javascript kick for the past few months. Because of this kick I have decided to start writing some lessons for those that are interested in learning javascript.

I plan on posting these lessons not only to http://sumsumsum.com but also to http://codewithdesign.com If you are interested in reading over these articles I will provide the links.

Along with the updates to sum3 I have been writing more articles for code with design.

30 Day Project

New years resolution time!

Work out more.
Make meals from home to take to the office.
Make a new web/proof of concept/mobile application in 30 days every 30 days. (should spark some new things to write about)
Index 1 billion websites.
Work with perl again.

This article is more about the 30 day projects every thirty days and this is the first day of this project cycle. I am actually going to just pretend that it is the 3rd day of the project just so I can work on something new every month. I can then have a better range of articles to work with and categorize them based on when I started development for said project. I also plan on giving my projects better names in the articles rather than call something an engine and just refer to it as an html5 project…

Anyway, just wanted to update and I might as well bring up the new project. I have just finished the OAuth script and tied it in with twitter to allow user access via the tweets. I have also registered a new domain for the project http://epitrakk.com/ The idea is that a user will be able to insert how far along they are in a series or check off the episodes that they have seen already. Simple concept and I should be able to OD on monster and have something operational in the next week or two.

 

A Dreamhost Issue leads to a new server

Oh Dreamhost…

Recently my crawling attempts came to a very swift end. I may filled up dreamhosts mysql server ‘madal’ which caused multiple peoples websites to go down until the issues were resolved. Unfortunately for me the resolving meant them moving my mysql information to a temporary location until I contacted customer support.

Turns out that their ‘unlimited’ policy only allows for 3GB of mysql usage rather than the 109GB of indexed websites I had sitting on there. This means that I had to find another solution for my problem.

Solution A: Dump archives of the indexes, dump raw content, dump backups and try to work with virtually no storage.

Solution B: Go and buy a mini form factor computer, pull the dvd drive out and replace it with a second hard disk. Turn the computer into a home server and start the process again with a better code design behind it.

New Server

I went with solution B because it allowed me to have a new toy and work with my own system. The server has an i3 putting out just over 3.0Ghz, 3TB total of hdd and 8GB of ram. This is hooked up via gigabit and allows me to better work things out and write some of the processing scripts in C++. This way I can constantly crunch information.

I don’t mean to hate on dreamhost. They really are an excellent company and they are pretty immediate when it comes to answering your problems for help. The so called unlimited policy is on the retarded side of things but for what I pay annually I can’t complain.

Indexing stats:

stored urls: 1.8m
checked urls: 56m
url count:  27m