Text_LanguageDetect

Tuesday, April 25, 2006

Status

I've set up a new home for the demo of the package: http://languagedetect.org . You gotta love cheap domain names.

I've compiled many more language defintions using wikipedia as sample text. I've been told that the existing Russian definition is of poor quality, so I would like to figure out how to tell if the new version is more accurate..

The new language definitions inlude East Asian languages. These don't fare nearly as well as their Western counterparts, but more or less work. With the exception of Chinese. There are simply so many characters in the Chinese writing system that it's unlikely you'll ever come across the same trigram combination more than once. This is a problem. I need to come up with some new way to score East Asian languages in order for this to work.

Simply looking for the existence of any Chinese glyphs as solid proof of Chinese is not enough, because Chinese writing especially on the internet will often contain scraps of other languages mixed in. Of course the opposite is true as well: samples of writing in other languages will sometimes contain Chinese characters. So there needs to be some kind of scoring rule for deciding what is what.

I also have a project or two for which character encoding detection would be useful, so I would like to start working on that soon.

There are a couple of bug fixes in CVS, and I may make a release of them before I start adding stuff. This is the most likely thing to happen soon.

7 Comments:

Anonymous Anonymous said...

Hi !
1st : thanks for this nice pear package, good job
2nd : is there a way to obtain the languages as a standart code (for example : cz, fr, ru) instead of a string in english (czech, french, russian) ? it is not standart and hard to compare with other data because of that :( (especially for my app wich doesn't speak english)
If not, it would be pretty usefull if you make some functions like this :
getLanguageCode($language) <<< these two ones would be static
getLanguageString($code) <<<
public function getLanguageCodes()
detectSimpleCode() << this one would be VERY useful :)
...

Personally, I would be happy to do it myself, but I don't understand how your package works (the .dat files structure looks quite hard to understand without documentation) ... maybe if you explain it to me I could try to do something ?

Anyway, thanks and keep on your good work ;)

1:09 PM  
Anonymous Christian Weiske said...

See https://pear.php.net/bugs/bug.php?id=19221 for the iso 639 2/3-letter code request. I'm implementing it now.

4:30 AM  
Blogger Unknown said...

See university of Nigeria which is one of the best university that does not consider weather this person is from any place interns of offering job or admission.Visit http://www.unn.edu.ng for other information about the school

9:44 AM  
Anonymous Anonymous said...

See university of Nigeria which is one of the best university that does not consider weather this person is from any place interns of offering job or admission.Visit http://www.unn.edu.ng for other information about the school

9:49 AM  
Blogger Unknown said...

Hi wel come to unn website,httP;//www.unn.edu.ng

9:52 AM  
Anonymous Anonymous said...

University of nigerian student are doing well.see more about the school.http://www.unn.edu.ng

9:55 AM  
Anonymous Anonymous said...

I am in fact glaԁ to reаd thіѕ webpage pоstѕ which carries lots of useful informаtion, thanks for рroνiding thеse kinds of
informatіon.

my homepage - credit card debt relief

6:27 AM  

Post a Comment

<< Home