lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: How international languages are supported in Lucene
Date Mon, 09 Jun 2008 21:52:20 GMT
Hi Daniel,

What makes you say that about language detection?  Wouldn't that depend on the language detection
approach or tool one uses and on the type and amount of content one trains language detector
on?  And what is the threshold for "reliable enough" that you have in mind?


Thanks,
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Daniel Noll <daniel@nuix.com>
> To: java-user@lucene.apache.org
> Sent: Thursday, June 5, 2008 7:36:11 PM
> Subject: Re: How international languages are supported in Lucene
> 
> > But basically consider why this must be so, especially when
> > stemming. Languages are so variable that you'd get wildly
> > different (and inappropriate) results if you tried to analyze them
> > with the same analyzer. Especially when you get different
> > language encodings in the document.
> 
> Well... technically encoding is out of the scope of Lucene since we're passing 
> in a Reader.
> 
> I have to say though, analysing with the most naive analyser possible (the 
> default one with no stop words and no stemming) works well enough.
> 
> Language detection isn't at a point where it's reliable enough to use to 
> determine which analyser to use automatically.
> 
> Daniel
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message