lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <tom.e.wh...@gmail.com>
Subject Re: NGram Language Categorization Source
Date Sat, 20 Aug 2005 20:29:12 GMT
Hi Kevin,

On 8/19/05, Kevin Burton <burtonator@gmail.com> wrote:
> Hey lucene guys.
> 
> I know for a fact that a bunch of you have been curious about language
> categorization for a long time now and Java has lacked a solid way to
> solve this problem.
> 
> Anyway.  This new library that I just released should be easy to tie
> into your lucene indexers.  Just use the library on a text (strip the
> HTML) and then create a new field in Lucene called LANG (or soemthing)
> and then create a filter before you search with JUST that language
> code.
> 
> I'd love some help with filling out missing languages if anyone has
> some spare time.  That help make up for all the hard work I've done
> here (nudge.. nudge)
> 
> I did a full research of the lang categorization space for Java and I
> think this is basically the only library out there.

I know of the following existing Java implementations of language
categorization:

* A Nutch implementation:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/languageidentifier/

* A Lucene patch: http://issues.apache.org/bugzilla/show_bug.cgi?id=26763

* JTextCat (http://www.jedi.be/JTextCat/index.html),  a Java wrapper
for libtextcat

* NGramJ (http://ngramj.sourceforge.net/), a general n-gram Java library

Of these, the Nutch one is certainly under active development, the
others don't seem to be as far as I can tell.

> 
> Good luck
> ...
> 
> I'm working on a blog post describing how blog search engines like
> Technorati, PubSub, and Feedster could/should use language
> categorization to help deal with the chaos of tagging and full-text
> search. Google has done this for a long time now and Technorati has it
> in beta.
> 
> http://www.feedblog.org/2005/08/ngram_language_.html
> 

I like your idea of using Wikipedia translations as the training
corpus - it's a good way to get fairly reliable sources for lots of
languages.

Regards,

Tom

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message