lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term
Date Sun, 28 Sep 2014 21:14:24 GMT
Hi,

How about coord factor? Does it kick in when stemmed and original tokens both match?



On Friday, September 26, 2014 1:32 AM, Diego Fernandez <difernan@redhat.com> wrote:
The difference comes in the fact that when you query the same form it matches 2 tokens including
the less common one.  When you query a different form you only match on the more common form.
 So really you're getting the "boost" from both the tiny difference in TF*IDF and the extra
token that you match on.

However, I agree that adding a payload might be a better solution.

----- Original Message -----
> Hi - but this makes no sense, they are scored as equals, except for tiny
> differences in TF and IDF. What you would need is something like a stemmer
> that preserves the original token and gives a < 1 payload to the stemmed
> token. The same goes for filters like decompounders and accent folders that
> change meaning of words.
>  
>  
> -----Original message-----
> > From:Diego Fernandez <difernan@redhat.com>
> > Sent: Wednesday 17th September 2014 23:37
> > To: solr-user@lucene.apache.org
> > Subject: Re: How does KeywordRepeatFilterFactory help giving a higher score
> > to an original term vs a stemmed term
> > 
> > I'm not 100% on this, but I imagine this is what happens:
> > 
> > (using -> to mean "tokenized to")
> > 
> > Suppose that you index:
> > 
> > "I am running home" -> "am run running home"
> > 
> > If you then query "running home" -> "run running home" and thus give a
> > higher score than if you query "runs home" -> "run runs home"
> > 
> > 
> > ----- Original Message -----
> > > The Solr wiki says   "A repeated question is "how can I have the
> > > original term contribute
> > > more to the score than the stemmed version"? In Solr 4.3, the
> > > KeywordRepeatFilterFactory has been added to assist this
> > > functionality. "
> > > 
> > > https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
> > > 
> > > (Full section reproduced below.)
> > > I can see how in the example from the wiki reproduced below that both
> > > the stemmed and original term get indexed, but I don't see how the
> > > original term gets more weight than the stemmed term.  Wouldn't this
> > > require a filter that gives terms with the keyword attribute more
> > > weight?
> > > 
> > > What am I missing?
> > > 
> > > Tom
> > > 
> > > 
> > > 
> > > ---------------------------------------------
> > > "A repeated question is "how can I have the original term contribute
> > > more to the score than the stemmed version"? In Solr 4.3, the
> > > KeywordRepeatFilterFactory has been added to assist this
> > > functionality. This filter emits two tokens for each input token, one
> > > of them is marked with the Keyword attribute. Stemmers that respect
> > > keyword attributes will pass through the token so marked without
> > > change. So the effect of this filter would be to index both the
> > > original word and the stemmed version. The 4 stemmers listed above all
> > > respect the keyword attribute.
> > > 
> > > For terms that are not changed by stemming, this will result in
> > > duplicate, identical tokens in the document. This can be alleviated by
> > > adding the RemoveDuplicatesTokenFilterFactory.
> > > 
> > > <fieldType name="text_keyword" class="solr.TextField"
> > > positionIncrementGap="100">
> > >  <analyzer>
> > >    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >    <filter class="solr.KeywordRepeatFilterFactory"/>
> > >    <filter class="solr.PorterStemFilterFactory"/>
> > >    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >  </analyzer>
> > > </fieldType>"
> > > 
> > 
> > --
> > Diego Fernandez - 爱国
> > Software Engineer
> > GSS - Diagnostics



> > 
> > 
> 

-- 
Diego Fernandez - 爱国
Software Engineer
GSS - Diagnostics

IRC: aiguofer on #gss and #customer-platform 

Mime
View raw message