lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: highlight - scoring fragments with more of the same token
Date Tue, 26 Sep 2006 21:30:03 GMT

markharw00d <markharw00d@yahoo.co.uk> wrote on 26/09/2006 00:11:12:
> If you were to score repeated terms then I suspect it would have to be
> done so that the repetitions didn't score as highly as the first
> occurrence - otherwise f2 could be selected as a better fragment than f3
> for the query q1 in your example.
> Repetitions of a term in a fragment could be scored as a very small
> fraction of the score given to the first occurrence. This would at least
> rank  f2 higher than f1 for query q2.
> Another potentially useful ranking factor may be to boost fragments
> found at the beginning of a document - that's where people tend to write
> summaries or introductions.

Yes, it makes sense to add these heuristics.

I was somewhat surprised to find that highlighting scoring simply counts
how many unique query terms appear in the fragment. Guess was expecting a
more similarity like ranking of fragments - something that would perhaps
have tf related to the frequency of a term in a fragment, and idf related
to the frequency of the term in the entire text. Idf would be meaningless
for a single term query. Possibly, idf could relate to "iff" ~ inverse
number of fragments containing the term. I am not sure if this is worth the
effort, but it seems more correct...?

Another thing I saw is that Highlighter seems to break the text arbitrarily
by max-fragment-size, so for text:
  1 2 x 4 a b x d y B C D
if it happens to be broken into 4 tokens fragments, for query "x y" result
would be:
  1 2 x 4 - score 1
  a b x d - score 1
  y B C D - score 1
and the first fragment would be selected 'best', although the fragment "x d
y B" that appears in that text is better. Again, not sure if this is worth
the effort - having overlapping between candidate fragments - just
something to think about.

>
>
> Doron Cohen wrote:
> > This question was raised in the user's list -
> > http://www.nabble.com/highlighting-tf2322109.html
> >
> > Assume three fragments and two queries:
> >   f1 = aa  11  bb  33  cc
> >   f2 = aa  11  bb  11  cc
> >   f3 = aa  11  bb  22  cc
> >   q1 = 11 22
> >   q2 = 11
> > Now we call highlighter.getBestFragment(q);
> > For q1, f3 is returned, as expected.
> > For q2, f1 is returned, although "11" appears twice in f2 but only once
in
> > f1.
> >
> > This is because QueryScorer.getTokenScore(Token) counts only unique
> > fragment tokens.
> >
> > Would it make sense to make this behavior controllable?
> > (It is easily done but I am not sure about the consequences.)
> >
> > Or perhaps there is a way to achieve this behavior (preferring f2 on f1
for
> > q2 above) that I missed?
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> >
> >
>
>
>
>
> ___________________________________________________________
> Copy addresses and emails from any email account to Yahoo! Mail -
> quick, easy and free. http://uk.docs.yahoo.com/trueswitch2.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message