lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
Date Mon, 27 Apr 2009 20:52:30 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703362#action_12703362
] 

Shai Erera commented on LUCENE-1593:
------------------------------------

bq. I wonder if, instead, we could define an Object getSentinelObject(), returning null by
default, and the pqueue on creation would call that and if it's non-null, fill the queue (by
calling it maxSize times)?

Some extensions of PQ may not know how to construct such a sentinel object. Consider ComparablePQ,
which assumes all items are Comparable. Unlike HitQueue, it does not know what will be the
items stored in the queue. But ... I guess it can return a Comparable which always prefers
the other element ... so maybe that's not a good example.
I just have the feeling that a setter method will give us more freedom, rather than having
to extend PQ just for that ... 

bq. Actually.... shouldn't Weight.scorer(...) in general be the place where such "pre-next()
initializatoin" is done?

Ok - so BS's add() is only called from BS2.score(Collector). Therefore BS can be initialized
from BS2 directly. Since both are package-private, we should be safe.
BS2's add() is called from BooleanWeight.scorer() (I'm sorry if I repeat what you wrote above,
but that's just me learning the code), and so can be initialized from there ... hmm I wonder
why this wasn't done so far?

I'll give it a try.

bq. That basically undoes the "don't fallback to docID" optimization right?

Right ... it's too late for me :)

bq. I guess since the 6 classes are hidden under the TopFieldCollector.create it's maybe not
so bad?

It's just that maintaining that class becomes more and more problematic. It already contains
6 inner classes, which duplicate the code to avoid 'if' statements. Meaning every time a bug
is found, all 6 need to be checked and fixed. With that proposal, it means 12 ...

But I wonder from where would we control it ... IndexSearcher no longer has a ctor which allows
to define whether docs should be collected in order or not (the patch removes it). The only
other place where it's defined is in BooleanQuery's static setter (affects all boolean queries).
But BooleanWeight receives the Collector, and does not create it ...
So, if we check in IndexSearcher's search() methods whether this parameter is set or not,
we can control the creation of TSDC and TFC. And if anyone else instantiates them on his own,
he should know whether he executes searches in-order or not. Back-compat-wise, TFC and TSDC
are still in trunk and haven't been released, so we shouldn't have a problem right?

Does that sound like a good approach? I still hate to duplicate the code in TFC, but I don't
think there's any other choice. Maybe create completely separate classes for TFC and TSDC?
although that will make code maintenance even harder.

> Optimizations to TopScoreDocCollector and TopFieldCollector
> -----------------------------------------------------------
>
>                 Key: LUCENE-1593
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1593
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>         Attachments: LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove
unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of
numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always
have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove
the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" for the
last SortField. But, doing so should not be necessary (since we already break ties by docID),
and is in fact less efficient (once the above optimization is in).
> # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add
a addDummyObjects method which will populate the queue without "arranging" it, just store
the objects in the array (this can be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message