I ran into a bit of a snag when my crawler hit a server that included a straight media file. The solution for this case which caused an unhandled timeout was to simply add in a context that was created.
$context = stream_context_create(array(‘http’ => array(‘timeout’ => 1)));
file_get_contents(“http://cjonasson.com/“, 0, $context);
I added a few counters to the interface but they are taxing the database a little bit to much. They are updating and they do look really good but they should only update once every 15 seconds rather than updating 10 or plus times per second.
A while back I started to work on the crawler and I am about to start working on it once again. This time I will be optimizing, storing the information and logging the speed of the database transaction to test how the cloud hosting service is handling this many connections.
Thank you for the unlimited bandwidth and transfer dreamhost because I will be making good use of your service tonight. I don’t plan on turning off this service and I would like to add in some form of multi tasking on the server. or at least increasing the connections made per second. So far it is indexing 10 sites a second but I think I could easily get that up to 50 with a little bit of work and a little lack of sleep.