lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: NGram Language Categorization Source
Date Sat, 20 Aug 2005 15:41:18 GMT
Hi Kevin,

>I know for a fact that a bunch of you have been curious about language
>categorization for a long time now and Java has lacked a solid way to
>solve this problem.
>
>Anyway.  This new library that I just released should be easy to tie
>into your lucene indexers.  Just use the library on a text (strip the
>HTML) and then create a new field in Lucene called LANG (or soemthing)
>and then create a filter before you search with JUST that language
>code.
>
>I'd love some help with filling out missing languages if anyone has
>some spare time.  That help make up for all the hard work I've done
>here (nudge.. nudge)
>
>I did a full research of the lang categorization space for Java and I
>think this is basically the only library out there.

[snip]

Recently I'd posted the following to the Nutch mailing list, since 
the topic of determining web page languages had come up there as well:

>Given the recent discussion regarding charset/language detection on 
>this list, people might find this IBM reseearch paper interesting:
>
><ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf
>
>     Linguini: Language Identification for Multilingual Documents
>     John M. Prager

Prager also uses an n-gram approach, so you might be able to take 
advantage of some of his research into optimal values for <n>.

The code to Linguini doesn't seem to be available (you have to 
purchase some IBM product(s) to get it) so what you've done is great 
for the open source community - thanks!

Also I could post to the Unicode list re training data in multiple 
languages, as that's a good place to find out about multilingual 
corpora.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message