lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Burton-West <tburt...@umich.edu>
Subject Re: tf and very short text fields
Date Thu, 03 Apr 2014 18:18:07 GMT
Hi Markus and Wunder,

I'm  missing the original context, but I don't think BM25 will solve this
particular problem.

The k1 parameter sets how quickly the contribution of tf to the score falls
off with increasing tf.   It would be helpful for making sure really long
documents don't get too high a score, but I don't think it would help for
very short documents without messing up its original design purpose.

For BM25, if you want to turn off length normalization, you set "b" to 0.
 However, I don't think that will do what you want, since turning off
normalization will mean that the score for "new york, new york"  will be
twice that of the score for "new york" since without normalization the tf
in "new york new york" is twice that of "new york".

I think the earlier suggestion to "override tfidfsimilarity and emit 1f in
tf() is probably the best way to switch to eliminate using tf counts,
assumming that is really what you want.

Tom








On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood <wunder@wunderwood.org>wrote:

> Thanks! We'll try that out and report back. I keep forgetting that I want
> to try BM25, so this is a good excuse.
>
> wunder
>
> On Apr 1, 2014, at 12:30 PM, Markus Jelsma <markus.jelsma@openindex.io>
> wrote:
>
> > Also, if i remember correctly, k1 set to zero for bm25 automatically
> omits norms in the calculation. So thats easy to play with without
> reindexing.
> >
> >
> > Markus Jelsma <markus.jelsma@openindex.io> schreef:Yes, override
> tfidfsimilarity and emit 1f in tf(). You can also use bm25 with k1 set to
> zero in your schema.
> >
> >
> > Walter Underwood <wunder@wunderwood.org> schreef:And here is another
> peculiarity of short text fields.
> >
> > The movie "New York, New York" should not be twice as relevant for the
> query "new york". Is there a way to use a binary term frequency rather than
> a count?
> >
> > wunder
> > --
> > Walter Underwood
> > wunder@wunderwood.org
> >
> >
> >
>
> --
> Walter Underwood
> wunder@wunderwood.org
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message