New years resolution time!
Work out more.
Make meals from home to take to the office.
Make a new web/proof of concept/mobile application in 30 days every 30 days. (should spark some new things to write about)
Index 1 billion websites.
Work with perl again.
This article is more about the 30 day projects every thirty days and this is the first day of this project cycle. I am actually going to just pretend that it is the 3rd day of the project just so I can work on something new every month. I can then have a better range of articles to work with and categorize them based on when I started development for said project. I also plan on giving my projects better names in the articles rather than call something an engine and just refer to it as an html5 project…
Anyway, just wanted to update and I might as well bring up the new project. I have just finished the OAuth script and tied it in with twitter to allow user access via the tweets. I have also registered a new domain for the project http://epitrakk.com/ The idea is that a user will be able to insert how far along they are in a series or check off the episodes that they have seen already. Simple concept and I should be able to OD on monster and have something operational in the next week or two.
Recently my crawling attempts came to a very swift end. I may filled up dreamhosts mysql server ‘madal’ which caused multiple peoples websites to go down until the issues were resolved. Unfortunately for me the resolving meant them moving my mysql information to a temporary location until I contacted customer support.
Turns out that their ‘unlimited’ policy only allows for 3GB of mysql usage rather than the 109GB of indexed websites I had sitting on there. This means that I had to find another solution for my problem.
Solution A: Dump archives of the indexes, dump raw content, dump backups and try to work with virtually no storage.
Solution B: Go and buy a mini form factor computer, pull the dvd drive out and replace it with a second hard disk. Turn the computer into a home server and start the process again with a better code design behind it.
I went with solution B because it allowed me to have a new toy and work with my own system. The server has an i3 putting out just over 3.0Ghz, 3TB total of hdd and 8GB of ram. This is hooked up via gigabit and allows me to better work things out and write some of the processing scripts in C++. This way I can constantly crunch information.
I don’t mean to hate on dreamhost. They really are an excellent company and they are pretty immediate when it comes to answering your problems for help. The so called unlimited policy is on the retarded side of things but for what I pay annually I can’t complain.
stored urls: 1.8m
checked urls: 56m
url count: 27m