Well I never thought I would get there and it took a few attempts but I managed to stick the CommonCrawls text corpses of the Internet into MongoDB. This is around 13TB of text, which is definitely more than I can read in a day!
The next task is to work out the language of each page, I have a few ways to do that, then ignore anything that is not English. I am not really sure how I would analyise text data in a language I do not understand hence why I plan to skip it.
After that I need to try and work out what each page’s context is, then chuck it in ElasticSearch.
CommonCrawl also contains millions (63,270,007) of links to files like PDFs, Docs, and images. I have started processing this data to see what useful information I can extra.
Shocking, I know, I think I need to hire someone to do a good job of it! Watch this space.