Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Fri, 1 May 2009 08:48:57 -0700
To: lucy-dev@lucene.apache.org
Subject: SortCollector
Message-ID: <20090501154857.GA28441@rectangular.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.13 (2006-08-11)
From: Marvin Humphrey <marvin@rectangular.com>

As I've been working up segment-centric sorted search for Lucy, I've been
spelunking the Lucene files TopScoreDocCollector.java and
TopFieldCollector.java.  I've also been following the JIRA issue LUCENE-1593
"Optimizations to TopScoreDocCollector and TopFieldCollector" (though the
discussion there has gotten very long and drifted off-topic, so I don't
necessarily recommend it).

It's hard to wrap your head around the code in TopScoreDocCollector.java and
TopFieldCollector.java until you realize that there's much less there than
meets the eye -- those two files are an egregious violation against the
principle of DRY, with lots of copy-and-paste code.  TopFieldCollector.java
contains 6 classes which are nearly the same.  

  * OneComparatorNonScoringCollector
  * OneComparatorScoringNoMaxScoreCollector
  * OneComparatorScoringMaxScoreCollector
  * MultiComparatorNonScoringCollector
  * MultiComparatorScoringMaxScoreCollector
  * MultiComparatorScoringNoMaxScoreCollector

This has been done because the collect() method in Collector subclasses is
used in tight loops, and what could have been a single class has been unrolled
for the sake of shaving CPU cycles on each loop iter.

In LUCENE-1593, a further doubling of the class count in TopFieldCollector to
12 is being contemplated, bringing the total number of near-identical classes
to 13 when you include TopScoreDocCollector.

I hope we can achieve the same ends in Lucy with at most 2 classes:
"SortCollector" and "TopDocCollector".

Maybe even just SortCollector.

SortCollector will have have two main components: a priority queue, and an
array of HitComparator objects.  It will behave similarly to the last class in
TopFieldColector, MultiComparatorScoringNoMaxScoreCollector.  Determining
whether a document is "competitive" and should be added to the inner queue
involves iterating over the comparators until one of them returns a non-zero
comparison.  Something like this:

    u32_t i = 0;
    i32_t comparison;
    u8_t *const comparator_types = self->comparator_types;
    const u32_t num_comparators  = self->num_comparators;
    HitComparator **const comparators = self->comparators;

    do {
        switch (comparator_types[i]) {
            case HitComp_BY_SCORE:   
                comparison = SI_compare_by_score(a, b);
                break;
            case HitComp_BY_SCORE_REV:   
                comparison = SI_compare_by_score(b, a);
                break;
            case HitComp_BY_DOC:     
                comparison = SI_compare_by_doc(a, b);
                break;
            case HitComp_BY_DOC_REV:     
                comparison = SI_compare_by_doc(b, a);
                break;
            case HitComp_BY_ORD2:    
                comparison = SI_compare_by_ord2(comparators[i], a, b);
                break;
            case HitComp_BY_ORD2_REV:    
                comparison = SI_compare_by_ord2(comparators[i], b, a);
                break;
            case HitComp_BY_ORD4:
                comparison = SI_compare_by_ord4(comparators[i], a, b);
                break;
            case HitComp_BY_ORD4_REV:
                comparison = SI_compare_by_ord4(comparators[i], b, a);
                break;
            case HitComp_BY_ORD8:    
                comparison = SI_compare_by_ord8(comparators[i], a, b);
                break;
            case HitComp_BY_ORD8_REV:    
                comparison = SI_compare_by_ord8(comparators[i], b, a);
                break;
            case HitComp_BY_ORD16:   
                comparison = SI_compare_by_ord16(comparators[i], a, b);
                break;
            case HitComp_BY_ORD16_REV:   
                comparison = SI_compare_by_ord16(comparators[i], b, a);
                break;
            case HitComp_BY_ORD32:   
                comparison = SI_compare_by_ord32(comparators[i], a, b);
                break;
            case HitComp_BY_ORD32_REV:   
                comparison = SI_compare_by_ord32(comparators[i], b, a);
                break;
            default:         
                comparison = 0;
                THROW("UNEXPECTED comparator type %u8", comparator_types[i]);
        }

    } while (comparison == 0 && ++i < num_comparators);

This algo has several advantages over the current Lucene implementation.

First off, tracking "max score" is a Lucene legacy requirement, so Lucy
doesn't need it.

Second, Lucy pushes the bulk of sort costs back into index time.  All sortable
fields are required to provide an mmap'd array of ords.  Even if we add new
binary FieldTypes like Int32FieldType or add support for custom sort
comparators, we will require the new sort types to provide ord arrays.

Third, the Lucene code makes method calls for each comparison, but for ord
comparisons, we can use static inline functions which resolve to array lookups.
(Coding convention: I use the prefix "SI_" for all static inline functions, and
the prefix "S_" for static functions -- so SI_compare_by_ord2() and friends are
static inline.)

Fourth, we might get a small gain by factoring the "reverse" argument to
SortSpec into the HitComparator "type", which is a u8_t.  So long as we limit
the total number of branches in the switch() statement to less than 256, it
should compile to a jump table, so reversing sort order will be cost-free:

    case HitComp_BY_ORD2:    
        comparison = SI_compare_by_ord2(comparators[i], a, b);
        break;
    case HitComp_BY_ORD2_REV:    
        comparison = SI_compare_by_ord2(comparators[i], b, a);
        break;

Lucene performs a multiply instead:

    final int c = reverseMul[i] * comparators[i].compareBottom(doc);

Multiply ops are cheap, though, so I'd be surprised if that turned out to make
a measurable difference.

I found one additional inefficiency in the Lucene implementation: score() is
called twice for "competitive" docs.  (This isn't true for
TopScoreDocCollector, which is the main class for collecting scored docs --
it's only true for the TopFieldDoc implementations that also track score). It
would be nice to avoid that, but it probably doesn't amount to much. 

For ordinary searches, sort order is determined by score, with doc num
breaking ties.  This behavior can be achieved using SortCollector like so:

    if (!sort_spec) {
        self->comparator_types[0] = HitComp_BY_SCORE;
        self->comparator_types[1] = HitComp_BY_DOC;
    }

The only difference between such a SortCollector and TopDocCollector is that
in TopDocCollector, the comparison is hard-wired and you save 2-4 pointer
dereferences, 2-4 array lookups, and 1-2 switch() ops.  For most Scorers, the
cost of calling Scorer_Next() and Scorer_Score(), plus the overhead of calling
SortCollector_Collect() itself, will swamp those savings.  However, a pure
TermScorer, which operates its own custom collection loop, might in theory
show a measurable difference, theoretically justifying the addition of
TopDocCollector pending the outcome of benchmarking.

However, I question whether that should be a priority.  If we really want to
max out performance for simple queries, we need to avoid eliminate method
calls altogether, get right next to mmap'd, PFOR-encoded posting data, and
collect hits into a purpose-built priority queue operating with mostly static
inline functions.  A "Matcher_TopDocs()" method, perhaps.

Speaking of purpose-built priority queues...  

It might also make sense to move the priority queue code which powers HitQueue
into SortCollector.  We can then hard-code it for dealing with our particular
sort algo, so that PriQ_Less_Than() becomes a static inline function.  

I don't expect much of a speedup, if any, from inlining HitQueue within
SortCollector.  However, doing things that way allows us to avoid duplicating
the above comparison code which iterates over the array of HitComparator
objects.

Instead, we duplicate the priority queue code -- but since the heap sort algo
is decades old, the small amount of code required to implement it is likely to
stay stable.

Marvin Humphrey