Recently I have been working on indexing everything. I built a scraper a week ago and currently have accumulated over half a million domains along with content such as title, paragraphs, tags etc.
Today’s job is to better parse this information and create a better structure system to track changes in content with revisions. I plan on storing a revision number, a url_nid and an md5 of the page’s content.
This image should give you a quick understanding of what the server load was a week ago and what it has been doing for the past week. It is pretty obvious when the scraper has been turned on and off…
This method should give me a decent understanding of the changes that are going to be taking place and will give me a better look at the information once plugged into a search engine that I built a few years ago.