lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Woodward (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans
Date Sat, 10 Nov 2012 22:25:13 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494785#comment-13494785
] 

Alan Woodward commented on LUCENE-2878:
---------------------------------------

I've started to use this branch in an (experimental!) system I'm developing for a client.
 The good news is that performance is generally much better than the existing system that
uses SpanQueries - faster query time and smaller memory footprint, and also nicer GC behaviour
(I can't give exact numbers, but suffice to say that where the previous system regularly ran
out of memory, this one hasn't yet)!

There are definitely some rough edges, though, which I'll try and smooth out and add as patches.

1) There isn't a replacement for SpanNotQueries - the BrouwerianIterator comes close, but
doesn't quite cover all the use cases.  In this instance, I need to have the equivalent of
a 'not within' operator - match intervals that do not fall within a given another interval.
 I've written a new iterator, which I've called an 'InverseBrouwerianIntervalIterator' for
want of a better name, but it definitely could do with some more eyes on it...

2) The API is not very nice when it comes to subclassing Iterators.  For example, I have 'anchor'
terms at the start and end of documents, which allow users to query for terms within a certain
distance from them.  These shouldn't be highlighted, so I created an AnchorTermQuery which
returned a different type of IntervalIterator that didn't do anything in its collect() method.
 To do this, I had to create an AnchorTermWeight, an AnchorTermScorer and an AnchorTermIntervalIterator,
all of which were more or less copy-pastes of the equivalent Term* classes; it would be nice
to make this easier...

3) MultiTermQueries don't return iterators unless you set their rewrite policies to something
other than CONSTANT_SCORE_REWRITE.

4) I found a bug in the iterators() method of DisjunctionSumScorer - if all subscorers are
PositionFilterScorers, then you can get NPEs if the subscorers have matches that don't pass
the filters.  I'll add a test case shortly

5) I had to run this without my scoring patch (this case doesn't actually use scoring, so
it doesn't matter that much), because MultiTermQueries can blow up in scoring if they get
rewritten into blank queries; I guess this wasn't a problem with Span* queries, but I haven't
had a chance to work out how to get round it.  Will add another test case for this as well.

All in all, though, these are looking much better than the equivalent SpanQueries.  Position
filters on boolean queries in particular work much better - the semantics of SpanQueries are
completely wrong for this, and involved generating very heavy queries for pretty simple cases.
 Nice work!
                
> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Positions Branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, gsoc2012, lucene-gsoc-11, lucene-gsoc-12, mentor
>             Fix For: Positions Branch
>
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch,
LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch,
LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch,
LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch,
LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch,
LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, LUCENE-2878-vs-trunk.patch,
PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can make use
of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring
comparable to what other queries do and at the end of the day they are duplicating lot of
code all over lucene. Span*Queries are also limited to other Span*Query instances such that
you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting feature
since they can not score based on term proximity since scores doesn't expose any positional
information. All those problems bugged me for a while now so I stared working on that using
the bulkpostings API. I would have done that first cut on trunk but TermScorer is working
on BlockReader that do not expose positions while the one in this branch does. I started adding
a new Positions class which users can pull from a scorer, to prevent unnecessary positions
enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create
the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this
API and other simply return null instead. 
> To show that the API really works and our BulkPostings work fine too with positions I
cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A
nice sideeffect of this was that the Position BulkReading implementation got some exercise
which now :) work all with positions while Payloads for bulkreading are kind of experimental
in the patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) including
the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother to implement the other
codecs yet since I want to get feedback on the API and on this first cut before I go one with
it. I will upload the corresponding patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext)
which I should probably do on trunk first but after that pain today I need a break first :).
> The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still
fails but I didn't look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message