lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"
Date Mon, 24 Aug 2009 10:03:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746797#action_12746797
] 

Michael McCandless commented on LUCENE-1821:
--------------------------------------------

{quote}
bq. decent comparator (StringOrdValComparator) that operates per segment.

Still, the StringOrdValComparator will have to break down and call String.equals() whenever
it compars docs in different IndexReaders
{quote}

Agreed, it will be slower than a top-level ords cache, but I'm
wondering in practice, in your case, what impact that turns out to be.
Also, since Lucene has already done this, maybe you could use its
StringOrdValComparator instead of having to cutover yours to
segment-based.

Or, better, work up a patch for a "forced" top-level StringComparator
for apps that don't mind the slow commit time and possible risk of
burning memory, in exchange for faster sorting.

Actually sorting (during collection) already gives you the docBase so
shouldn't your app already have the context needed for this?

{quote}
The idea of this is to disable per segment searching?
I don't actually want to do that. I want to use per segment searching functionality to take
advantage of caches on per segment basis where possible, and map docs to the IndexSearcher
context when i can't do per segment caching.
{quote}

OK

{quote}
bq. Could you compute the top-level ords, but then break it up per-segment?

I think i see what your getting at here, and i've already thought of this as a potential solution.
The cache will always need to be created at the top most level, but it will be pre-broken
out into a per-segment cache whose context is the top level IndexSearcher/MultiReader. The
biggest problem here is the complexity of actually creating such a cache, which i'm sure will
translate to this cache loading slower (hard to say how much slower without implementing)
I do plan to try this approach, but i expect this will be at least a week or two out from
now.

I've currently updated my code for this to work per-segment by adding the docBase when performing
the lookup into this cache (which is per-IndexSearcher)
I did this using my getIndexReaderBase() funciton i added to my subclass of IndexSearcher
during Scorer construction time (I can live with this, however i would like to see getIndexReaderBase()
added to IndexSearcher, and the IndexSearcher passed to Weight.scorer() so i don't need to
hold onto my IndexSearcher subclass in my Weight implementation)
{quote}

OK sounds like an at least workable solution.

{quote}
bq. just return the "virtual" per-segment DocIdSet.

Thats what i'm doing now. I use the docid base for the IndexReader, along with its maxDoc
to have the Scorer represent a virtual slice for just the segment in question
The only real problem here is that during Scorer initialization for this i have to call fullDocIdSetIter.advance(docBase)
in the Scorer constructor. If advance(int) for the DocIdSet in question is O(N), this adds
an extra penalty per segment that did not exist before
{quote}

Hmm... is advance in fact costly for your DocIdSets?

{quote}
bq. This isn't a long-term solution, since the order in which Lucene visits the readers isn't
in general guaranteed,

that's where IndexSearcher.getIndexReaderBase(IndexReader) comes into play. If you call this
in your scorer to get the docBase, it doesn't matter what order the segments are searched
in (as it'll always return the proper base (in the context of the IndexSearcher that is))
{quote}

I think adding that one method for 2.9 would make sense?  (Marking it
expert, subject to change).  Because... assuming your app is OK w/
somehow (privately, external to Lucene) having access to the top
IndexSearcher via it's custom Weight, this one method would allow you
to not have to subclass IndexSearcher.


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 3.1
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a Scorer to know
the "actual" doc id for the document's it matches (only the relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all segments), there
is now no way to index into them properly from inside a Scorer because the scorer is not passed
the needed offset to calculate the "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as well as a method
to get the offset
> All Weights that have "sub" weights must pass this offset down to created "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed in Searcher
(casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if gatherSubReaders
in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message