lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
Date Mon, 08 Dec 2008 14:57:18 GMT

Mark Miller wrote:

> Michael McCandless wrote:
>> Mark Miller wrote:
>>> What do we get from this though? A MultiSearcher (with the   
>>> scoring issues) that can properly do rewrite? Won't we have to  
>>> take MultiSearchers scoring baggage into this as well?
>> If this can work, what we'd get is far better reopen() performance
>> when you sort-by-field, with no change to the returned results
>> (rewrite, scores, sort order are identical).
>> Say you have 1MM doc index, and then you add 100 docs & commit.
>> Today, when you reopen() and then do a search, FieldCache recomputes
>> from scratch (iterating through all Terms in entire index) the global
>> arrays for the fields you're sorting on.  The cost is in proportion  
>> to
>> total index size.
>> With this change, only the new segment's terms will be iterated on,  
>> so
>> the cost is in proportion to what new segments appeared.
>> This is the same benefit we are seeking with LUCENE-831, for all uses
>> of FieldCache (not just sort-by-field), it's just that I think we can
>> achieve this speedup to sort-by-field without LUCENE-831.
> Yup, I'm with you on all that. Except the without LUCENE-831 part -  
> we need some FieldCache meddling right? The current FieldCache  
> approach doesn't allow us to meddle much. Isn't it more like, we  
> want the LUCENE-831 API (or something similar), but we won't need  
> the objectarray or merge stuff?

We wouldn't need any change to FieldCache, because we only ask  
FieldCache for int[] (eg) on the SegmentReader instances.  Because  
reopen() shares SegmentReader instances, only the new segments would  
have a cache miss in FieldCache.  I think?

Once we do LUCENE-831, minus objectarray and merging, this change  
would be basically the same, ie, accessing per-segment int values,  
just with a new API.  Ie, by doing this change first I don't think  
we're going to waste much in then cutting over in the future to  
LUCENE-831's API (vs waiting for LUCENE-831 api).

>> I think there would be no change to the scoring: we would still  
>> create
>> a Weight based on the toplevel IndexReader, but then search each
>> sub-reader separately, using that Weight.
>> Though... that is unusual (to create a Weight with the parent
>> IndexSearcher and then use it in the sub-searchers) -- will something
>> break if we do that?  (This is new territory for me).
> Okay, right. That does change things. Would love to hear more  
> opinions, but that certainly seems reasonable to me. You score each  
> segment using tf/idf stats from all of the segments.

That's my expectation (hope).  So the results are identical but  
performance is much better.

>> If something will break, I think we can still achieve this, but it
>> will be a more invasive change and probably will have to be re- 
>> coupled
>> to the new API we will introduce with LUCENE-831.  Marvin actually
>> referred to how to do this, here:
>> #action_12650854
>> in the paragraph starting with "If our goal is minimal impact...".
>> Basically during collection, the FieldSortedHitQueue would have to
>> keep track of subReaderIndex/subReaderDocID (mapping, through
>> iteration, from the primary docID w/o doing a wasteful new binary
>> search for each) and enroll into different pqueues indexed by
>> subReaderIndex, then do the merge sort in the end.
>> Mike
>>> Michael McCandless wrote:
>>>> On thinking more about this... I think with a few small changes we
>>>> could achieve Sort by field without materializing a full array.  We
>>>> can decouple this change from LUCENE-831.
>>>> I think all that's needed is:
>>>> * Expose sub-readers (LUCENE-1475) by adding IndexReader[]
>>>>   IndexReader.getSubReaders.  Default impl could just return
>>>>   length-1 array of itself.
>>>> * Change IndexSearcher.sort that takes a Sort, to first call
>>>>   IndexReader.getSubReaders, and then do the same logic that
>>>>   MultiSearcher does, with improvements from LUCENE-1471 (run
>>>>   separate search per-reader, then merge-sort the top hits from
>>>>   each).
>>>> The results should be functionally identical to what we have today,
>>>> but, searching after doing a reopen() should be much faster since  
>>>> we'd
>>>> no longer re-build the global FieldCache array.
>>>> Does this make sense?  It's a small change for a big win, I think.
>>>> Does anyone want to take a crack at this patch?
>>>> Mike
>>>> Mark Miller wrote:
>>>>> Michael McCandless wrote:
>>>>>> I'd like to decouple "upgraded to Object" vs "materialize full  
>>>>>> array", ie, so we can access native values w/o materializing  
>>>>>> the full array.  I also think "upgrade to Object" is dangerous  
>>>>>> to even offer since it's so costly.
>>>>> I'm right with you. I didn't think the Object approach was  
>>>>> really an upgrade (beyond losing the merge, which is especially  
>>>>> important for StringIndex - it has no merge option at the  
>>>>> moment) which is why I left both options for now. So I def agree  
>>>>> we need to move to iterator, drop object, etc.
>>>>> Its the doin' that aint so easy. The iterator approach seems  
>>>>> somewhat straightforward (though its complicated by needing to  
>>>>> provide a random access object as well), but I'm still working  
>>>>> through how we control so many iterator types (I dont see how  
>>>>> you can use polymorphism yet ).
>>>>> - Mark
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>>> For additional commands, e-mail:
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message