Text_LanguageDetect

Tuesday, April 25, 2006

Status

I've set up a new home for the demo of the package: http://languagedetect.org . You gotta love cheap domain names.

I've compiled many more language defintions using wikipedia as sample text. I've been told that the existing Russian definition is of poor quality, so I would like to figure out how to tell if the new version is more accurate..

The new language definitions inlude East Asian languages. These don't fare nearly as well as their Western counterparts, but more or less work. With the exception of Chinese. There are simply so many characters in the Chinese writing system that it's unlikely you'll ever come across the same trigram combination more than once. This is a problem. I need to come up with some new way to score East Asian languages in order for this to work.

Simply looking for the existence of any Chinese glyphs as solid proof of Chinese is not enough, because Chinese writing especially on the internet will often contain scraps of other languages mixed in. Of course the opposite is true as well: samples of writing in other languages will sometimes contain Chinese characters. So there needs to be some kind of scoring rule for deciding what is what.

I also have a project or two for which character encoding detection would be useful, so I would like to start working on that soon.

There are a couple of bug fixes in CVS, and I may make a release of them before I start adding stuff. This is the most likely thing to happen soon.

Tuesday, February 14, 2006

Stuck in encoding

So I started working on encoding detection... but I ran into some conceptual problems. Specifically, some languages can be written in multiple scripts, which in turn may be written in different encodings. For that matter, virtually any language can be written in any phonetic script. What to do about that?

Thursday, February 02, 2006

encoding detection

I've figured out how I want to do encoding dection... I've decided to take the incremental approach and support single-byte encodings only at first.

Hopefully I'll have some time to work on this over the weekend.

Friday, January 27, 2006

work

I've been busy lately. No work in LanguageDetect has been going on in the past few days.

Wednesday, January 18, 2006

Satisfied

I made some speed improvements and a basic speed tester. For longer test strings the difference in speed should only be 5-10% slower than the old version (still slower). However, since this class has been shown to be very accurate even for very short strings, one may speed up detections by truncating test strings to only, say, 300K at most for virtually no loss in accuracy.

Should post the release to PEAR in the next hour.

Tuesday, January 17, 2006

Delay

I've decided to delay the release until I can figure out why the package is slower for longer text samples.

The new version works much faster for short strings but is somewhat slower for longer ones.

Anyone interested can otherwise check it out from CVS.

Monday, January 16, 2006

Finished

All remaining bugs have been fixed for the next release, and all of the unit tests new and old pass successfully. All that remains now are minor cleanup tasks, especially in the inline documentation. Expect a release tomorrow.