lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
Date Tue, 01 Feb 2011 20:31:29 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989374#comment-12989374
] 

Michael McCandless commented on LUCENE-2878:
--------------------------------------------

This patch looks awesome (and, enormous)!  Finally we are making
progress merging Span* into their corresponding non-positional queries
:)

I like how you added payloads to the BulkPostings API, and how
someone is finally testing the bulk positions code.

So now I can run any Query, and ask it to enumerate its positions
(PositionInterval iterator), but not paying any price if I don't want
positions.  And it's finally single source... caller must say
up-front (when pulling the scorer) if it will want positions (and,
separately, also payloads -- great).

It's great that you can just emulate spans on the new api with
SpanScorerWrapper/MockSpanQuery, and use PositionFilterQuery to filter
positions from a query, eg to turn a BooleanQuery into whatever
SpanQuery is needed -- very nice!

How does/should scoring work?  EG do the SpanQueries score
according to the details of which position intervals match?

The part I'm wondering about is what API we should use for
communicating positions of the sub scorers in a BooleanQuery to
consumers like position filters (for matching) or eg Highlighter
(which really should be a core functionality that works w/ any query).
Multiplying out ("denormalizing") all combinations (into a flat stream
of PositionIntervals) is going to be too costly in general, I think?

Maybe, instead of the denormalized stream, we could present a
UnionPositionsIntervalIterator, which has multiple subs, where each
sub is its own PositionIntervalIterator?  This way eg a NEAR query
could filter these subs in parallel (like a merge sort) looking for a
match, and (I think) then presenting its own union iterator to whoever
consumes it?  Ie it'd only let through those positions of each sub
that satisfied the NEAR constraint.


> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>         Attachments: LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can make use
of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring
comparable to what other queries do and at the end of the day they are duplicating lot of
code all over lucene. Span*Queries are also limited to other Span*Query instances such that
you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting feature
since they can not score based on term proximity since scores doesn't expose any positional
information. All those problems bugged me for a while now so I stared working on that using
the bulkpostings API. I would have done that first cut on trunk but TermScorer is working
on BlockReader that do not expose positions while the one in this branch does. I started adding
a new Positions class which users can pull from a scorer, to prevent unnecessary positions
enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create
the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this
API and other simply return null instead. 
> To show that the API really works and our BulkPostings work fine too with positions I
cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A
nice sideeffect of this was that the Position BulkReading implementation got some exercise
which now :) work all with positions while Payloads for bulkreading are kind of experimental
in the patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) including
the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother to implement the other
codecs yet since I want to get feedback on the API and on this first cut before I go one with
it. I will upload the corresponding patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext)
which I should probably do on trunk first but after that pain today I need a break first :).
> The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still
fails but I didn't look into the MemoryIndex BulkPostings API yet)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message