lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From oren bochman <orenboch...@gmail.com>
Subject Re: Hunspell stemmer generates multiple tokens
Date Sat, 08 Jun 2013 00:00:46 GMT
Multiple tokens seems to be a more flexible contract.

You might want to be able to match just the stem, both the exact token and  the stemmed token
or just the exact term. So putting both in the index may be expedient, depending on the language.

Also there are  a number of common situations where document text can be stemmed more  accurately
than query text. In such cases you might want to boost the stemmed token adaptively.

Sent from my iPhone

On Jun 7, 2013, at 16:16, Luca Cavanna <cavannaluca@gmail.com> wrote:

> Hi,
> I just noticed that the HunspellStemmer outputs more than one tokens, the
> original word plus the stems as far as I understood.
> 
> This is not quite what I would expect and becomes tricky especially at
> query time. Using for instance elasticsearch to query a stemmed field, a
> boolean query would be generated, containing multiple clauses (one for each
> token generated by the stemmer) instead of just a clause with the stem that
> we expect to find in the index (if we indexed using stemming of course).
> 
> I would like to know if you think this is the correct behaviour and if this
> is something you are aware of. If I look at snowball for example, I see
> that only one token is generated.
> 
> 
> Thanks,
> Luca

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message