lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <>
Subject Re: NGram Language Categorization Source
Date Sat, 20 Aug 2005 20:29:12 GMT
Hi Kevin,

On 8/19/05, Kevin Burton <> wrote:
> Hey lucene guys.
> I know for a fact that a bunch of you have been curious about language
> categorization for a long time now and Java has lacked a solid way to
> solve this problem.
> Anyway.  This new library that I just released should be easy to tie
> into your lucene indexers.  Just use the library on a text (strip the
> HTML) and then create a new field in Lucene called LANG (or soemthing)
> and then create a filter before you search with JUST that language
> code.
> I'd love some help with filling out missing languages if anyone has
> some spare time.  That help make up for all the hard work I've done
> here (nudge.. nudge)
> I did a full research of the lang categorization space for Java and I
> think this is basically the only library out there.

I know of the following existing Java implementations of language

* A Nutch implementation:

* A Lucene patch:

* JTextCat (,  a Java wrapper
for libtextcat

* NGramJ (, a general n-gram Java library

Of these, the Nutch one is certainly under active development, the
others don't seem to be as far as I can tell.

> Good luck
> ...
> I'm working on a blog post describing how blog search engines like
> Technorati, PubSub, and Feedster could/should use language
> categorization to help deal with the chaos of tagging and full-text
> search. Google has done this for a long time now and Technorati has it
> in beta.

I like your idea of using Wikipedia translations as the training
corpus - it's a good way to get fairly reliable sources for lots of



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message