lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Burton <>
Subject Re: NGram Language Categorization Source
Date Sun, 21 Aug 2005 20:39:43 GMT
> >
> >     Linguini: Language Identification for Multilingual Documents
> >     John M. Prager
> Prager also uses an n-gram approach, so you might be able to take
> advantage of some of his research into optimal values for <n>.

Yeah.. though to be honest I as long as you're on the long tail
portion of N the values won't matter much I think.

All you'll do is waste a bit of memory (like 1k)
> The code to Linguini doesn't seem to be available (you have to
> purchase some IBM product(s) to get it) so what you've done is great
> for the open source community - thanks!
> Also I could post to the Unicode list re training data in multiple
> languages, as that's a good place to find out about multilingual
> corpora.

Yeah. That was my biggest problem. This area had never really been
solved in the OSS world.

 Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web -
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message