To Lang=“en” or not to Lang?

So work has been progressing nicely on and it’s up to 2mil + pages however, one of my aims was to only index English content. I thought by searching out for the HTML tag “lang”, it would be really easy to do! Even if there were a few different versions like en-gb or en-us, I thought it would still be an easy task. I also thought that all popular/mainstream/big sites would use this tag!

So 2 million English pages later and well…. I have over 300 different Lang=”*en*” variations! Plus major sites like Wikipedia don’t even use the tag!!

I guess it’s back to the drawing board!! I now need to use some sort of word matching algorithm to look at the page content and then work out if it’s English! A simplistic way of this could be to search for very common English words to see if they exists (the a at there etc) or not.

Update coming soon 🙂

