Tumbleweed and that damn CommonCrawl

I hate by starting a post apologising for not updating my blog so I won’t do that!

I have been a bit busy with the new job I started 3 weeks ago, so most of my side projects have been paused!  However, work on Burf.co has gone 2 steps forward, a couple to the left and then a couple steps backwards, this is largely due to the awesome site CommonCrawl.org having a huge part of the Internet crawled and open for anyone to use! They have petabytes of web data open for anyone to use and there are some really cool examples of how to use it, most involving a huge amount of cloud power!! I did ponder for quite a while how I would store so much data!!! I found an interesting Java project that scans the index of the CommonCrawl for interesting file types (https://github.com/centic9/CommonCrawlDocumentDownload).

I took this project, hacked it about a bit and changed it so that it would only return URLs that are mine type HTML and that had a response status of 200. This gave me around 50 million URLs to play with which all had file pointers to the actual web page data. Because this data is compressed, it’s far quicker to download them from the CommonCrawl than actually scrapping the website itself. CommonCrawl also follows the Robot.txt which is far more than I have ever done :). So far the end result is that I can get around 5 million pages of data a day (from my home internet) compared to around 500k on a good day!  That’s a pretty good increase!

Leave a Reply