The answer is 2,828,752,948

Well I never thought I would get there and it took a few attempts but I managed to stick the CommonCrawls text corpses of the Internet into MongoDB. This is around 13TB of text, which is definitely more than I can read in a day!

The next task is to work out the language of each page, I have a few ways to do that, then ignore anything that is not English. I am not really sure how I would analyise text data in a language I do not understand hence why I plan to skip it.

After that I need to try and work out what each page’s context is, then chuck it in ElasticSearch.

File Search

CommonCrawl also contains millions (63,270,007) of links to files like PDFs, Docs, and images. I have started processing this data to see what useful information I can extra.

Burf.co Website

Shocking, I know, I think I need to hire someone to do a good job of it! Watch this space.

This weeks update : Bye Bye MongoDB

So it is exciting times! I have made some progress with TRTLExchange, however, due to things outside of my control it been slower than expected.  So I have turned my spare time to Burf.co, my new search engine project and while there is no website for it yet (will be by the weekend), the actual search technology(code) has come along leaps and bounds.  Overnight it managed to index over 500,000 pages which for a single server, was pretty cool.  It did get up to 1.3 million pages but MongoDB has erm, shit the bed(many many times).  This could be a hardware limit (Harddrive speed) or some performance thing I need to do however it gets to the point where I can’t even insert more records without timeouts.  This concerns me quite a bit as I have a HP Blade Server on way to somewhat up the crawling rate by a factor of 8.  I am going to try and give it one last go today however its taken 12 hours to delete the data from the DB (I did remove instead of drop 🙁 ).  It has been a very interesting learning curve on learning MongoDB.  I think unless some magic happens I am going to try out Postgres next.

On the Swift front I did start building the frontend for Burf, first I was going to do this in VueJS, however, I have now found that Swift’s server-side framework Perfect supports templating via Mustache.  I think I will make faster progress writing it all in Swift than switching back and forth.   I still want to continue learning VueJS on the side (used for the TRTLExchange) as Javascript is such a good thing to know nowadays.

Writing this blog post has also just raised the point that I was trying to learn Kotlin about a month ago (facepalm).  Damn!

 

Experimenting with MongoKitten

As mentioned in my previous post, I have started looking at Server Side Swift with the aim to build a search engine (Burf.co).  To store my crawled data I decided to try and use MongoDB as it supports full-text search out of the box.  The original Burf.com used Equinox (made by Compsoft) and then later used Microsoft Indexing Service.  This time round I wanted to be a little more scalable.  Now there are probably better DB solutions for what I plan to do, but MongoDB seemed really simple to get up and running with.  Later on, I should be able to switch out the database layer if needed.

MongoKitten

Now that I had decided to use Swift, and MongoDB, I needed to find a framework that connects them, my friend (who knows his stuff) recommended MongoKitten!  I got up and running with it fairly quickly even though I don’t know MongoDB too well. Life was good, however, there were a few things I did struggle with:

Contains

So, search a field for a partial string requires you to use Regex (it seems).  

Mongo:

db.users.findOne({“username” : {$regex : “.*eBay.*”}});

MongoKitten:

let query: Query = [

           “url”: RegularExpression(pattern: “.\(eBay).”)

       ]

let matchingEntities: CollectionSlice<Document> = try pages.find(query)

Sorting results on $meta textScore

MongoDB allows you to setup full text searching across your data, it can be across an entire record, or just certain fields (name, address etc).  When you perform a full-text search, MongoDB returns the relevant records with an accuracy score ($meta.textScore).  MongoDB lets you change how it creates these scores by allowing you to adjust the weights each field receives e.g name is more important than address.

Mongo:

db.pages.find( {$text: {$search: “ebay”}},{score: {$meta: “textScore” }}).sort({score: {$meta:”textScore”}})

MongoKitten:

let query: Query = [

           “$text”: [“$search”: str ],

           “lang” : [“$eq”: “en”],

       ]        

let projection: Projection = [

            “_id”: .excluded,

           “url”: “url”,

           “title”: “title”,

           “score”: [“$meta” : “textScore”]

       ]    

let sort: Sort = [

           “score”: .custom([

               “$meta”: “textScore”

               ])

       ]      

let matchingEntities: CollectionSlice<Document> = try pages.find(query, sortedBy: sort, projecting: projection, readConcern:nil,  collation:nil, skipping: 0, limitedTo: Settings.searchResultLimit )

Getting Help

I found the best way to get help was to contact the creator(Joannis) of MongoKitten via Slack, he is pretty busy but super helpful!