lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: NGram Language Categorization Source
Date Sat, 20 Aug 2005 15:41:18 GMT
Hi Kevin,

>I know for a fact that a bunch of you have been curious about language
>categorization for a long time now and Java has lacked a solid way to
>solve this problem.
>Anyway.  This new library that I just released should be easy to tie
>into your lucene indexers.  Just use the library on a text (strip the
>HTML) and then create a new field in Lucene called LANG (or soemthing)
>and then create a filter before you search with JUST that language
>I'd love some help with filling out missing languages if anyone has
>some spare time.  That help make up for all the hard work I've done
>here (nudge.. nudge)
>I did a full research of the lang categorization space for Java and I
>think this is basically the only library out there.


Recently I'd posted the following to the Nutch mailing list, since 
the topic of determining web page languages had come up there as well:

>Given the recent discussion regarding charset/language detection on 
>this list, people might find this IBM reseearch paper interesting:
>     Linguini: Language Identification for Multilingual Documents
>     John M. Prager

Prager also uses an n-gram approach, so you might be able to take 
advantage of some of his research into optimal values for <n>.

The code to Linguini doesn't seem to be available (you have to 
purchase some IBM product(s) to get it) so what you've done is great 
for the open source community - thanks!

Also I could post to the Unicode list re training data in multiple 
languages, as that's a good place to find out about multilingual 

-- Ken
Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message