Text_LanguageDetect

Tuesday, April 25, 2006

Status

I've set up a new home for the demo of the package: http://languagedetect.org . You gotta love cheap domain names.

I've compiled many more language defintions using wikipedia as sample text. I've been told that the existing Russian definition is of poor quality, so I would like to figure out how to tell if the new version is more accurate..

The new language definitions inlude East Asian languages. These don't fare nearly as well as their Western counterparts, but more or less work. With the exception of Chinese. There are simply so many characters in the Chinese writing system that it's unlikely you'll ever come across the same trigram combination more than once. This is a problem. I need to come up with some new way to score East Asian languages in order for this to work.

Simply looking for the existence of any Chinese glyphs as solid proof of Chinese is not enough, because Chinese writing especially on the internet will often contain scraps of other languages mixed in. Of course the opposite is true as well: samples of writing in other languages will sometimes contain Chinese characters. So there needs to be some kind of scoring rule for deciding what is what.

I also have a project or two for which character encoding detection would be useful, so I would like to start working on that soon.

There are a couple of bug fixes in CVS, and I may make a release of them before I start adding stuff. This is the most likely thing to happen soon.