lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Burton <burtona...@gmail.com>
Subject Re: NGram Language Categorization Source
Date Sun, 21 Aug 2005 20:39:43 GMT
><ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf
> >
> >     Linguini: Language Identification for Multilingual Documents
> >     John M. Prager
> 
> Prager also uses an n-gram approach, so you might be able to take
> advantage of some of his research into optimal values for <n>.

Yeah.. though to be honest I as long as you're on the long tail
portion of N the values won't matter much I think.

All you'll do is waste a bit of memory (like 1k)
 
> The code to Linguini doesn't seem to be available (you have to
> purchase some IBM product(s) to get it) so what you've done is great
> for the open source community - thanks!
> 
> Also I could post to the Unicode list re training data in multiple
> languages, as that's a good place to find out about multilingual
> corpora.

Yeah. That was my biggest problem. This area had never really been
solved in the OSS world.

-- 
 Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://www.feedblog.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message