lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: How to pass additional information into Similarity.scorePayload(...)
Date Fri, 15 Feb 2008 07:32:46 GMT
Op Friday 15 February 2008 02:47:14 schreef Cedric Ho:
> Sorry that I didn't make myself clear.
> 
> [10/5/2] means for terms found in the 1st paragraph, give it score*10,
> for terms in the 2nd, give it score*5, etc.
> 
> So I don't know how to do this scoring if the position (paragraph)
> information is in a separate field.

For each word in the input stream make sure that the position
at which it is indexed in an extra field is the same as the paragraph
number. That will involve only allowing a position increment at
a paragraph border during indexing.
Call this extra field the paragraph field if you will.

Then, during search, search for a Term in paragraph field, and 
use the position from that field, i.e. the paragraph number
to find a weight for the found term.
Have a look at PhraseQuery on how to use term positions during
search. It computes relative positions, but it works on the absolute
positions that it gets from the index.

SpanFirstQuery also allows to do that, it's a bit more involved, but
in the end it works from the same absolute positions from the index.
The version at the jira issue will even allow to use the length of the
matching spans as the absolute paragraph number, which, in turn,
allows the use of a Similarity for the paragraph weights [10/5/2].

There is nothing special about indexed term positions; any term can
be indexed at any position in a field. Lucene will take advantage of
the incremental nature of positions by storing only compressed
differences of positions in the index, but during search the original
positions are directly available, You can do the same with payloads,
but why reimplement something that is already available?

Payloads have better uses than positional info, for one they are
great to avoid disjunctions. For example for verbs, one could
index only the stem and use a payload for the actual inflected
form (singular/plural, past/present, first/second/third person, etc).

Regards,
Paul Elschot


> 
> Cedric
> 
> 
> On Fri, Feb 15, 2008 at 7:15 AM, Paul Elschot <paul.elschot@xs4all.nl> wrote:
> > I have no idea what the [10/5/2] means, so I can't comment on that.
> >  In case I have missed it previously I'm sorry.
> >
> >  My point was that payloads need not be used for different position info.
> >  It's possible to do that, and it may be good for performance in some cases,
> >  but one can revert to using another field for different position info.
> >
> >  Regards,
> >  Paul Elschot
> >
> >
> >  Op Thursday 14 February 2008 09:44:40 schreef Cedric Ho:
> >
> >
> > > Hi Paul,
> >  >
> >  > Sorry I am not sure I understand your solution.
> >  >
> >  > Because I would need to apply this scoring logic to all the different
> >  > types of Queries. A search may consists of something like:
> >  >
> >  > +(term1 phrase2 wildcard*)  +spanNear(term3 term4) [10/5/2]
> >  >
> >  > And this [10/5/2] ratio have to be applied to the whole search query
> >  > before it. So I am not sure how would using just SpanFirstQuery with a
> >  > separate field work in this situation.
> >  >
> >  > Anyway, I know my requirement is a bit strange, so it's ok if I can't
> >  > do this in Lucene. I'll settle with using a ThreadLocal to store the
> >  > [10/5/2] weighting and retrieve it in the Similarity.scorePayload(...)
> >  > function.
> >  >
> >  >
> >  > BTW, this problem I am facing now is different from the last one I
> >  > asked here, which you have proposed with the Modified SpanFirstQuery
> >  > solution =)
> >  >
> >  > But I am really grateful with all the helps I get here. Keep up the good work!
> >  >
> >  > Cheers,
> >  > Cedric
> >  >
> >  >
> >  > On Thu, Feb 14, 2008 at 2:58 PM, Paul Elschot <paul.elschot@xs4all.nl>
wrote:
> >  > > Op Thursday 14 February 2008 02:11:24 schreef Cedric Ho:
> >  > >
> >  > > > I am using Lucene's Built-in query classes: TernQuery, PhraseQuery,
> >  > >  > WildcardQuery, BooleanQuery and many of the SpanQueries.
> >  > >  >
> >  > >  > The info I am going to pass in is just some weightings for different
> >  > >  > part of the indexed contents. For example if the payload indicate
that
> >  > >  > a term is in the 2nd paragraph, then I'll take the weighting for
the
> >  > >  > 2nd paragraph and multiply it by the score.
> >  > >  >
> >  > >  > So it seems without writing my own query there's no way to do it
?
> >  > >
> >  > >  In case it is only positional information that is stored in the payload
> >  > >  (i.e. some integer number that does not decrease when tokenizing the
> >  > >  document), it is also possible to use an extra field and make sure the
> >  > >  position increment for that field is only positive when the number
> >  > >  (currently your payload) increases.
> >  > >  A SpanFirstQuery on this extra field would almost do, and you will
> >  > >  probably need https://issues.apache.org/jira/browse/LUCENE-1093 .
> >  > >  This will be somewhat slower than using a payload, because the search
> >  > >  will be done in two separate fields, but it will work.
> >  > >
> >  > >  Regards,
> >  > >  Paul Elschot
> >  > >
> >  > >
> >  > >
> >  > >  ---------------------------------------------------------------------
> >  > >  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  > >  For additional commands, e-mail: java-user-help@lucene.apache.org
> >  > >
> >  > >
> >  >
> >  > ---------------------------------------------------------------------
> >  > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  > For additional commands, e-mail: java-user-help@lucene.apache.org
> >  >
> >  >
> >  >
> >
> >  ---------------------------------------------------------------------
> >  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message