lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cedric Ho" <cedric...@gmail.com>
Subject Re: How to pass additional information into Similarity.scorePayload(...)
Date Sat, 16 Feb 2008 02:54:19 GMT
Thanks ~ Yes it seems this would be quite difficult to achieve with
Lucene. Nevermind, I'll try to figure out a workaround for it.

Thanks for helping =)

Cedric


On Feb 16, 2008 5:30 AM, Paul Elschot <paul.elschot@xs4all.nl> wrote:
> Hi Cedric,
>
> I think I'm beginning to get the point of the [10/5/2],
> and why you called that requirement a bit strange, see below.
>
> To use both normal position info and paragraph position info
> you'll need two separate, one normal, and one for the paragraphs.
>
> To use the normal field to determine the matches, and  the
> paragraph field to determine the weightings of these matches
> the TermPositions of both fields will have to be advanced
> completely in sync. That is possible, but not really nice to do.
> If Lucene had multiple positions for an indexed term, it
> would be straightforward.
> But as long as that is not the case, you'll either have to advance
> the two TermPositions in sync, or use payloads with the
> paragraph numbers.
>
> Or you could relax the paragraph numbering requirement
> into a positional requirement, and use the modified SpanFirstQuery.
> That could be done by using an avarage paragraph length to
> determine the weight at the matching position.
> As this is easy to implement, I'd first implement this and try to sell
> it to the users :)
>
> At that marketing moment you might as well ask the users
> what they think of matches that cross paragraph borders.
> Do you already have a firm requirement for that case?
>
> SpanNotQuery can be used to prevent matches over paragraph
> borders when these are indexed as such, but I would not expect
> that you would need those, given the fuzzyness of the [10/5/2].
>
> Regards,
> Paul Elschot
>
>
> Op Friday 15 February 2008 09:45:58 schreef Cedric Ho:
>
> > Hi Paul,
> >
> > Do you mean the following?
> >
> > e.g. to index this: "first second third <paragraphBorder> forth fifth six"
> >
> > originally it would be indexed as:
> > (first,0) (second,1) (third,2) (forth,3) (fifth,4) (six,5)
> >
> > now it will be:
> > (first,0) (second,0) (third,0) (forth,1) (fifth,1) (six,1)
> >
> > Then those Query classes that depends on the positional information
> > (PhraseQuery, SpanQueries) won't work then? unfortunately I'll need
> > those Query classes as well.
> >
> > Cedric
> >
> >
> > >  For each word in the input stream make sure that the position
> > >  at which it is indexed in an extra field is the same as the paragraph
> > >  number. That will involve only allowing a position increment at
> > >  a paragraph border during indexing.
> > >  Call this extra field the paragraph field if you will.
> > >
> > >  Then, during search, search for a Term in paragraph field, and
> > >  use the position from that field, i.e. the paragraph number
> > >  to find a weight for the found term.
> > >  Have a look at PhraseQuery on how to use term positions during
> > >  search. It computes relative positions, but it works on the absolute
> > >  positions that it gets from the index.
> > >
> > >  SpanFirstQuery also allows to do that, it's a bit more involved, but
> > >  in the end it works from the same absolute positions from the index.
> > >  The version at the jira issue will even allow to use the length of the
> > >  matching spans as the absolute paragraph number, which, in turn,
> > >  allows the use of a Similarity for the paragraph weights [10/5/2].
> > >
> > >  There is nothing special about indexed term positions; any term can
> > >  be indexed at any position in a field. Lucene will take advantage of
> > >  the incremental nature of positions by storing only compressed
> > >  differences of positions in the index, but during search the original
> > >  positions are directly available, You can do the same with payloads,
> > >  but why reimplement something that is already available?
> > >
> > >  Payloads have better uses than positional info, for one they are
> > >  great to avoid disjunctions. For example for verbs, one could
> > >  index only the stem and use a payload for the actual inflected
> > >  form (singular/plural, past/present, first/second/third person, etc).
> > >
> > >  Regards,
> > >  Paul Elschot
> > >
> >
>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message