lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
Date Sun, 28 Oct 2012 21:51:13 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485723#comment-13485723
] 

Michael McCandless commented on LUCENE-2878:
--------------------------------------------

{quote}
bq. I'm confused on how one uses IntervalIterator along with the Scorer it "belongs" to. Say
I want to visit all Intervals for a given TermQuery ... do I first get the TermScorer and
then call .intervals, up front? And then call TermScorer.nextDoc(), but then how to iterate
over all intervals for that one document? EG, is the caller supposed to call IntervalIterator.scorerAdvanced
for each next'd doc?

so my major goal here was to make this totally detached, optional and lazy ie no additional
code in scorer except of IntervalIterator creation on demand. once you have a scorer you can
call intervals() and get an iterator. This instance can and should be reused while docs are
collected / scored / matched on a given reader. For each doc I need to iterate over intervals
I call scorerAdvanced and update the internal structures this prevents any additional work
if it is not really needed ie. on a complex query/scorer tree. Once the iterator is setup
(scorerAdvanced is called) you can just call next() on it in a loop --> while ((interval
= iter.next) != null) and get all the intervals. makes sense?
{quote}

OK so it sounds like I pull one IntervalIterator up front (and use it
for the whole time), and it's my job to call .scorerAdvanced(docID)
every time I either .nextDoc or .advance the original Scorer?  Ie this
"resets" my IntervalIterator onto the current doc's intervals.

bq. I think the scorer#interval method javadoc make this clear, no? if not I should make it
clear!

I was still confused :)  I'll take a stab at improving it ... also we
should add @experimental...

{quote}
bq. I'm also confused why TermIntervalIterator.collect only collects one interval (I think?).
Shouldn't it collect all intervals for the current doc?

the collect method is special. It's an interface that allows to collect the "current" interval
or all "current" intervals that contributed to a higher level interval. For each next call
you should call collect if you need all the subtrees intervals or the leaves. one usecase
where we do this right now is highlighing. you can highlight based on phrases ie. if you collect
on a BQ or you can do individual terms ie. collect leaves. makes sense?
{quote}

Ahhh .... so it visits all intervals in the query tree leading up to
the current match interval (of the top query) that you've iterated to?
OK.  Maybe we can find a better name ... can't think of one now :)

                
> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Positions Branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, gsoc2012, lucene-gsoc-11, lucene-gsoc-12, mentor
>             Fix For: Positions Branch
>
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch,
LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch,
LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch,
LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch,
LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch,
LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, LUCENE-2878-vs-trunk.patch, PosHighlighter.patch,
PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can make use
of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring
comparable to what other queries do and at the end of the day they are duplicating lot of
code all over lucene. Span*Queries are also limited to other Span*Query instances such that
you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting feature
since they can not score based on term proximity since scores doesn't expose any positional
information. All those problems bugged me for a while now so I stared working on that using
the bulkpostings API. I would have done that first cut on trunk but TermScorer is working
on BlockReader that do not expose positions while the one in this branch does. I started adding
a new Positions class which users can pull from a scorer, to prevent unnecessary positions
enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create
the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this
API and other simply return null instead. 
> To show that the API really works and our BulkPostings work fine too with positions I
cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A
nice sideeffect of this was that the Position BulkReading implementation got some exercise
which now :) work all with positions while Payloads for bulkreading are kind of experimental
in the patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) including
the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother to implement the other
codecs yet since I want to get feedback on the API and on this first cut before I go one with
it. I will upload the corresponding patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext)
which I should probably do on trunk first but after that pain today I need a break first :).
> The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still
fails but I didn't look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message