lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: How international languages are supported in Lucene
Date Tue, 10 Jun 2008 00:46:58 GMT
Aha, I see.  I wasn't referring to character-encoding-based lang ID.  That is probably good
enough if you need to know if the text is English or if it's Chinese or Japanese or Korean
or Russian Cyrillic or Arabic or...

I think there is a bit missing in your statement about training.  You can't just train on,
say "English".  There are all kinds of "English" out there.  The distribution of terms in
WSJ is probably a lot different from what you might find in some corpus of medical texts or
botany tests or contemporary art.

As for not everyone having access to a corpus, I've come to love Wikipedia for exactly this
thing. :)  For an NLP class I took recently we used Croatian Wikipedia while developing an
unsupervised morphological analyzer for highly inflected languages (Croatian being one of
them - 7 cases, a pile of case/number/gender-dependent suffixes...).  The analyzer ended up
matching the state-of-the-art results. :)

Sematext -- -- Lucene - Solr - Nutch

----- Original Message ----
> From: Daniel Noll <>
> To:
> Sent: Tuesday, June 10, 2008 2:09:10 AM
> Subject: Re: How international languages are supported in Lucene
> On Tuesday 10 June 2008 07:49:29 Otis Gospodnetic wrote:
> > Hi Daniel,
> >
> > What makes you say that about language detection?  Wouldn't that depend on
> > the language detection approach or tool one uses and on the type and amount
> > of content one trains language detector on?  And what is the threshold for
> > "reliable enough" that you have in mind?
> I can't come up with a number of course, but I can say for certain that ICU's 
> detector is unusable for detecting languages.  It's barely good enough to 
> correctly identify the charset; if you create a simple test in one charset it 
> often detects it as another.  If you then re-encode the text in that charset, 
> it detects it as being yet another, and so forth.
> If you know of any better [open source] libraries for the same purpose, I'd 
> love to hear of it.
> Additionally, anything the developer or user has to train I consider 
> unreliable also.  If a detector has to be trained, it should be trained by 
> the ones who are distributing it.  Not everyone has a corpus of every 
> language in the world in order to train such a thing. :-/
> Daniel
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message