Text_LanguageDetect

Wednesday, January 11, 2006

Unix

Sometimes piping long strings of unix commands together is fun.

For instance, on the new language data files:

wc * | sort +2 -r | grep -v total | grep -v zh.txt | head -30

So far from downloading random wikipedia pages, Tamil comes in on top with the most characters but Japanese has the most lines. Of course, it's all probably a feature of wc not knowing the utf8 format.

0 Comments:

Post a Comment

<< Home