><ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf
> >
> > Linguini: Language Identification for Multilingual Documents
> > John M. Prager
>
> Prager also uses an n-gram approach, so you might be able to take
> advantage of some of his research into optimal values for <n>.
Yeah.. though to be honest I as long as you're on the long tail
portion of N the values won't matter much I think.
All you'll do is waste a bit of memory (like 1k)
> The code to Linguini doesn't seem to be available (you have to
> purchase some IBM product(s) to get it) so what you've done is great
> for the open source community - thanks!
>
> Also I could post to the Unicode list re training data in multiple
> languages, as that's a good place to find out about multilingual
> corpora.
Yeah. That was my biggest problem. This area had never really been
solved in the OSS world.
--
Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator, Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|