lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: AW: Best practices for multiple languages?
Date Wed, 19 Jan 2011 22:08:41 GMT

Le 19 janv. 2011 à 20:56, Bill Janssen a écrit :

> Paul Libbrecht <> wrote:
>> So you are only indexing "analyzed" and querying "analyzed". Is that correct?
> Yes, that's correct.  I fall back to StandardAnalyzer if no
> language-specific analyzer is available.

>> Wouldn't it be better to prefer precise matches (a field that is
>> analyzed with StandardAnalyzer for example) but also allow matches are
>> stemmed.
> StandardAnalyzer isn't quite precise, is it?  StandardFilter does some
> kind of English-centric alterations to things.

from here:

I can only conclude that it handles correctly the characters variety but does not stemming.
The default constructor of StandardAnalyzer comes with a bunch of stop-words but they are
easily deactivatable.

I think it's quite precise, and certainly a lot more precise than removing the aux of chevaux!

> Perhaps the approach you suggest would be slightly better, but I'd have
> to see numbers on that from some reasonable corpus to be convinced it
> would be worth it.

I am not sure I have these.
I did several changes of this sort and the precision and recall measures went better in particular
in presence of language-indication failure which happened to be very common in our authoring

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message