The answer is 2,828,752,948

Well I never thought I would get there and it took a few attempts but I managed to stick the CommonCrawls text corpses of the Internet into MongoDB. This is around 13TB of text, which is definitely more than I can read in a day!

The next task is to work out the language of each page, I have a few ways to do that, then ignore anything that is not English. I am not really sure how I would analyise text data in a language I do not understand hence why I plan to skip it.

After that I need to try and work out what each page’s context is, then chuck it in ElasticSearch.

File Search

CommonCrawl also contains millions (63,270,007) of links to files like PDFs, Docs, and images. I have started processing this data to see what useful information I can extra.

Burf.co Website

Shocking, I know, I think I need to hire someone to do a good job of it! Watch this space.

Leave a Reply