Text_LanguageDetect: January 2006

work

I've been busy lately. No work in LanguageDetect has been going on in the past few days.

Satisfied

I made some speed improvements and a basic speed tester. For longer test strings the difference in speed should only be 5-10% slower than the old version (still slower). However, since this class has been shown to be very accurate even for very short strings, one may speed up detections by truncating test strings to only, say, 300K at most for virtually no loss in accuracy.

Should post the release to PEAR in the next hour.

Delay

I've decided to delay the release until I can figure out why the package is slower for longer text samples.

The new version works much faster for short strings but is somewhat slower for longer ones.

Anyone interested can otherwise check it out from CVS.

Finished

All remaining bugs have been fixed for the next release, and all of the unit tests new and old pass successfully. All that remains now are minor cleanup tasks, especially in the inline documentation. Expect a release tomorrow.

Tuesday

I cleaned up one more bug....

I could rush and fix the unicode problem I was having, but I don't want to release when I'm going to be off in the woods for the 3-day weekend in case there's some killer bug.

Another day

Looks like there's at least one calculation being done wrong in the new unicode stuff based on the new tests I devised so that's going to hold things up for at least another day.

Parser

I've decided on Parser for the new class. This means that a new release of the package will be coming soon.

I'm trying to resist the temptation of adding new languages.

Unix

Sometimes piping long strings of unix commands together is fun.

For instance, on the new language data files:

wc * | sort +2 -r | grep -v total | grep -v zh.txt | head -30

So far from downloading random wikipedia pages, Tamil comes in on top with the most characters but Japanese has the most lines. Of course, it's all probably a feature of wc not knowing the utf8 format.

What I'm working on now

The next version is finished -- I even got it to pass all of the dozens of regression tests even though I completely wrote the parser. I guess the only thing holding me up now is the naming of an object.

Whenever the detector wants to slice-and-dice a piece of text, it instantiates this new object. Should it be called a Sample object (as in a sample of text) or a Parser object? Both imply different things to future development, I think, in the way the object should be used or subclassed.

Also, I'm downloading more training text from wikipedia. I've found that the other-lingual wikipedias are mostly worthless auto-generated text if they have fewer than 1000 articles. Too bad, something tickles me inside about being able to detect Yiddish. I don't think new langauges will make it into the next release.

And finally, the demo page is failing. I don't run the server it's on so I don't know who changed what.

Design choices

I decided to use blogspot for this blog, because it's a) not going to go out of business like these "free wordpress" sites might, and more importantly b) I don't want to worry about maintaining it.

This second point may prove itself to be barking up the wrong tree, as I spent at least an hour getting RSS feeds that I wanted to show up correctly. First I tried AJAX; and was quickly reminded that AJAX can't retrieve urls from outside the current domain (so lame).

Then I tried one of those free RSS -> javascript sites. I know these can slow down loading of your whole page -- and that is exactly what happened.

However, I figured out a way to hack it so that it won't do this. This blogger template uses "pure CSS" to do the layout, and, I noticed that the page stops rendering when it reaches the part of the page where the javascript tag to fetch the converted RSS is. So all I had to do was stick the javascript tag to the very end of the HTML code, and position it wherever I wanted using CSS. This worked perfectly.

I tinkered a bit with the layout. I moved the left sidebar to the right, and it took me a while to figure out how to float the title which had been in the sidebar code to over the main content body.

It begins

This is the start of the Text_LanguageDetect development blog

Text_LanguageDetect

Friday, January 27, 2006

work

Wednesday, January 18, 2006

Satisfied

Tuesday, January 17, 2006

Delay

Monday, January 16, 2006

Finished

Friday, January 13, 2006

Tuesday

Another day

Thursday, January 12, 2006

Parser

Wednesday, January 11, 2006

Unix

What I'm working on now

Design choices

It begins

Previous Posts

Archives