lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation
Date Mon, 08 Dec 2008 16:25:05 GMT
I tried a quick poor mans version using a MultiSearcher and wrapping the 
sub readers as searchers. Other than some AUTO sort field detection 
problems, all tests do appear to pass. The new sort stuff for 
MultiSearcher may be a tiny bit off...sort tests fail, though are only 
slightly off, with that patch. Havn't looked further yet - just hacked 
it up real quick. Seems to work, but needs work.

- Mark


Michael McCandless wrote:
>
> Mark Miller wrote:
>
>> Michael McCandless wrote:
>>>
>>> Mark Miller wrote:
>>>
>>>> What do we get from this though? A MultiSearcher (with the  scoring 
>>>> issues) that can properly do rewrite? Won't we have to take 
>>>> MultiSearchers scoring baggage into this as well?
>>>
>>> If this can work, what we'd get is far better reopen() performance
>>> when you sort-by-field, with no change to the returned results
>>> (rewrite, scores, sort order are identical).
>>>
>>> Say you have 1MM doc index, and then you add 100 docs & commit.
>>> Today, when you reopen() and then do a search, FieldCache recomputes
>>> from scratch (iterating through all Terms in entire index) the global
>>> arrays for the fields you're sorting on.  The cost is in proportion to
>>> total index size.
>>>
>>> With this change, only the new segment's terms will be iterated on, so
>>> the cost is in proportion to what new segments appeared.
>>>
>>> This is the same benefit we are seeking with LUCENE-831, for all uses
>>> of FieldCache (not just sort-by-field), it's just that I think we can
>>> achieve this speedup to sort-by-field without LUCENE-831.
>>
>> Yup, I'm with you on all that. Except the without LUCENE-831 part - 
>> we need some FieldCache meddling right? The current FieldCache 
>> approach doesn't allow us to meddle much. Isn't it more like, we want 
>> the LUCENE-831 API (or something similar), but we won't need the 
>> objectarray or merge stuff?
>
> We wouldn't need any change to FieldCache, because we only ask 
> FieldCache for int[] (eg) on the SegmentReader instances.  Because 
> reopen() shares SegmentReader instances, only the new segments would 
> have a cache miss in FieldCache.  I think?
>
> Once we do LUCENE-831, minus objectarray and merging, this change 
> would be basically the same, ie, accessing per-segment int values, 
> just with a new API.  Ie, by doing this change first I don't think 
> we're going to waste much in then cutting over in the future to 
> LUCENE-831's API (vs waiting for LUCENE-831 api).
>
>>> I think there would be no change to the scoring: we would still create
>>> a Weight based on the toplevel IndexReader, but then search each
>>> sub-reader separately, using that Weight.
>>>
>>> Though... that is unusual (to create a Weight with the parent
>>> IndexSearcher and then use it in the sub-searchers) -- will something
>>> break if we do that?  (This is new territory for me).
>>
>> Okay, right. That does change things. Would love to hear more 
>> opinions, but that certainly seems reasonable to me. You score each 
>> segment using tf/idf stats from all of the segments.
>
> That's my expectation (hope).  So the results are identical but 
> performance is much better.
>
>>> If something will break, I think we can still achieve this, but it
>>> will be a more invasive change and probably will have to be re-coupled
>>> to the new API we will introduce with LUCENE-831.  Marvin actually
>>> referred to how to do this, here:
>>>
>>>  https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650854#action_12650854

>>>
>>>
>>> in the paragraph starting with "If our goal is minimal impact...".
>>> Basically during collection, the FieldSortedHitQueue would have to
>>> keep track of subReaderIndex/subReaderDocID (mapping, through
>>> iteration, from the primary docID w/o doing a wasteful new binary
>>> search for each) and enroll into different pqueues indexed by
>>> subReaderIndex, then do the merge sort in the end.
>>>
>>> Mike
>>>
>>>>
>>>> Michael McCandless wrote:
>>>>>
>>>>> On thinking more about this... I think with a few small changes we
>>>>> could achieve Sort by field without materializing a full array.  We
>>>>> can decouple this change from LUCENE-831.
>>>>>
>>>>> I think all that's needed is:
>>>>>
>>>>> * Expose sub-readers (LUCENE-1475) by adding IndexReader[]
>>>>>   IndexReader.getSubReaders.  Default impl could just return
>>>>>   length-1 array of itself.
>>>>>
>>>>> * Change IndexSearcher.sort that takes a Sort, to first call
>>>>>   IndexReader.getSubReaders, and then do the same logic that
>>>>>   MultiSearcher does, with improvements from LUCENE-1471 (run
>>>>>   separate search per-reader, then merge-sort the top hits from
>>>>>   each).
>>>>>
>>>>> The results should be functionally identical to what we have today,
>>>>> but, searching after doing a reopen() should be much faster since 
>>>>> we'd
>>>>> no longer re-build the global FieldCache array.
>>>>>
>>>>> Does this make sense?  It's a small change for a big win, I think.
>>>>> Does anyone want to take a crack at this patch?
>>>>>
>>>>> Mike
>>>>>
>>>>> Mark Miller wrote:
>>>>>
>>>>>> Michael McCandless wrote:
>>>>>>>
>>>>>>> I'd like to decouple "upgraded to Object" vs "materialize full

>>>>>>> array", ie, so we can access native values w/o materializing
the 
>>>>>>> full array.  I also think "upgrade to Object" is dangerous to

>>>>>>> even offer since it's so costly.
>>>>>>>
>>>>>>>
>>>>>> I'm right with you. I didn't think the Object approach was really

>>>>>> an upgrade (beyond losing the merge, which is especially 
>>>>>> important for StringIndex - it has no merge option at the moment)

>>>>>> which is why I left both options for now. So I def agree we need

>>>>>> to move to iterator, drop object, etc.
>>>>>>
>>>>>> Its the doin' that aint so easy. The iterator approach seems 
>>>>>> somewhat straightforward (though its complicated by needing to 
>>>>>> provide a random access object as well), but I'm still working 
>>>>>> through how we control so many iterator types (I dont see how you

>>>>>> can use polymorphism yet ).
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> ---------------------------------------------------------------------

>>>>>>
>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message