lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
Date Fri, 23 Jan 2009 18:07:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666601#action_12666601
] 

markrmiller@gmail.com edited comment on LUCENE-1483 at 1/23/09 10:06 AM:
---------------------------------------------------------------

bq. I was looking after the initial warmup, but noticed no difference. Maybe the string field
I used was not distinct enough. What is a good number for a noticeable speed improve (50%
distinct terms?).

Hes not saying after the warm up, but that the warm up should be faster based on that.

Its because of this:

The old way, if you had 5 segments with unique terms distributions of 50,000, 6000, 6000,
5, 5, then for the old way, we would try to load all 62,010 terms for every segment - 62,010
x 5 -310,050.

With the new way, we load 50,000 terms for the first, 6000 for the next, then 6000, then 5
and 5: total of 62,010.

Even though most of the 62,010 wont be found in the 5 term segment, it still takes a long
time to check them all. So the more unique terms and the more segments, the worse the problem
got.

*edit*
little fix on those numbers

      was (Author: markrmiller@gmail.com):
    bq. I was looking after the initial warmup, but noticed no difference. Maybe the string
field I used was not distinct enough. What is a good number for a noticeable speed improve
(50% distinct terms?).

Hes not saying after the warm up, but that the warm up should be faster based on that.

Its because of this:

The old way, if you had 5 segments with unique terms distributions of 50,000, 6000, 6000,
5, 5, then for the old way, we would try to load all 62,010 terms for every segment - 50000
x 5 -310,050.

With the new way, we load 50,000 terms for the first, 6000 for the next, then 6000, then 5
and 5: total of 62,010.

Even though most of the 50,000 wont be found in the 5 term segment, it still takes a long
time to check them all. So the more unique terms and the more segments, the worse the problem
got.

*edit*
little fix on those numbers
  
> Change IndexSearcher multisegment searches to search each individual segment using a
single HitCollector
> --------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1483
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1483
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py,
sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The current
method of searching multiple segments is to use a MultiSegmentReader and treat all of the
segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes
reopen expensive. If only a few segments change, the FieldCache is still loaded for all of
them.
> This patch changes things by searching each individual segment one at a time, but sharing
the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed
on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple
segments can be much faster as well - with the old method, all unique terms for every segment
is enumerated against each segment - because of the likely logarithmic change in terms per
segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document
statistics from the multireader are used to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub searcher.
Ordinals are not comparable across segments. To account for this, a new field sort enabled
HitCollector is introduced that is able to collect and sort across segments (because of its
ability to compare ordinals across segments). This TopFieldCollector class will collect the
values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values
so that they can be compared against the values for the new segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in both sorted
and non sorted search. We were seeing a good loss on indices with lots of segments (1000?)
and certain queue sizes / queries, but the latest results seem to show thats been mostly taken
care of (you shouldnt be using such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders.
Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders
and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. Part of the
TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader.
All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message