lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@hoplahup.net>
Subject Re: AW: Best practices for multiple languages?
Date Wed, 19 Jan 2011 22:08:41 GMT

Le 19 janv. 2011 à 20:56, Bill Janssen a écrit :

> Paul Libbrecht <paul@hoplahup.net> wrote:
> 
>> So you are only indexing "analyzed" and querying "analyzed". Is that correct?
> 
> Yes, that's correct.  I fall back to StandardAnalyzer if no
> language-specific analyzer is available.

> 
>> Wouldn't it be better to prefer precise matches (a field that is
>> analyzed with StandardAnalyzer for example) but also allow matches are
>> stemmed.
> 
> StandardAnalyzer isn't quite precise, is it?  StandardFilter does some
> kind of English-centric alterations to things.

from here:
http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/analysis/standard/StandardTokenizer.html

I can only conclude that it handles correctly the characters variety but does not stemming.
The default constructor of StandardAnalyzer comes with a bunch of stop-words but they are
easily deactivatable.


I think it's quite precise, and certainly a lot more precise than removing the aux of chevaux!

> Perhaps the approach you suggest would be slightly better, but I'd have
> to see numbers on that from some reasonable corpus to be convinced it
> would be worth it.

I am not sure I have these.
I did several changes of this sort and the precision and recall measures went better in particular
in presence of language-indication failure which happened to be very common in our authoring
environment.

paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message