lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <dan...@nuix.com>
Subject Re: How international languages are supported in Lucene
Date Tue, 10 Jun 2008 00:09:10 GMT
On Tuesday 10 June 2008 07:49:29 Otis Gospodnetic wrote:
> Hi Daniel,
>
> What makes you say that about language detection?  Wouldn't that depend on
> the language detection approach or tool one uses and on the type and amount
> of content one trains language detector on?  And what is the threshold for
> "reliable enough" that you have in mind?

I can't come up with a number of course, but I can say for certain that ICU's 
detector is unusable for detecting languages.  It's barely good enough to 
correctly identify the charset; if you create a simple test in one charset it 
often detects it as another.  If you then re-encode the text in that charset, 
it detects it as being yet another, and so forth.

If you know of any better [open source] libraries for the same purpose, I'd 
love to hear of it.

Additionally, anything the developer or user has to train I consider 
unreliable also.  If a detector has to be trained, it should be trained by 
the ones who are distributing it.  Not everyone has a corpus of every 
language in the world in order to train such a thing. :-/

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message