incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: SortCollector
Date Wed, 20 May 2009 00:18:12 GMT
I've now finished a provisional implementation of SortCollector, based on
mmap'd sort caches.

Custom sort orders are supported via overriding FieldType's Compare_Values()
method in a subclass.

    public i64_t
    Compare_Values(FieldType *self, Obj *a, Obj *b);

SortCollector has replaced both TopDocCollector and an older SortCollector
module in the KS code base, and there's now only one HitQueue-based sorting
collector class.

SortWriter doesn't scale well yet, because it just uses a Hash to uniquify
values.  We'll address that with an external sorter -- but that coming change
won't affect SortCollector.

On Fri, May 01, 2009 at 05:45:26PM -0400, Michael McCandless wrote:

> Ideally, in the future, I would love to always favor code simplicity
> (as you generally do with Lucy) in Lucene's core and then rely on
> source code specialization (LUCENE-1594) to write the
> ugly-but-super-fast code.

A compiler/code-generator, basically.

For SortCollector, I think there's still room for algorithmic improvement --
especially on Lucy where there are no back-compat constraints.  It's too soon
to resort to heavy-duty unrolling techniques to eke out those last few cycles.
Sometimes the "replace conditionals with polymorphism" refactoring technique
yields great results, but in this case, the conditionals have not yet been

> Float/Double will have to compute separate ords (at indexing time)?
> (Lucene just uses the float/double values themselves).

We should probably just use values.  That means we'll have to deal with endian
issues on every comparison, though, since we're reading directly off of disk
rather than pre-loading.  Same thing with integers, including datestamp types.

(Incidentally, we have a related inefficiency when loading text sort cache
values: for security reasons, we'll have to check UTF-8 validity on each load

My inclination is to encode big-endian consistently.  If the byte-swapping
proves to be costly, my second choice would be to go with ords.  I'd prefer to
avoid making byte order configurable.

> Do you plan on allowing arbitrary bit-sized ords, not just
> 2/4/8/16/32?  (We've discussed that).

In theory we could.  But it's a lot of bit twiddling for diminishing returns.

I'm certainly not going to write such code myself -- I've got more important
things to do.  But I wouldn't be excited about accepting a patch, either,
because that means the community takes on responsibility for maintenence of
some gnarly code in return for a marginal benefit, while the file format spec
gets more difficult to support.

> Requiring custom comparators to compile down to ords at indexing time
> is possibly to restrictive?

I expect the common use case will be custom-sorted text fields.  The user
wouldn't have to know anything about the ords; we'd generate all that behind
the scenes after sorting using their custom Compare_Values() method.

What other use cases do we anticipate?

> You're working per-segment right?

Yes.  Segment-centric sorted search has been implemented.  

> > Third, the Lucene code makes method calls for each comparison, but for ord
> > comparisons, we can use static inline functions which resolve to array lookups.
> > (Coding convention: I use the prefix "SI_" for all static inline functions, and
> > the prefix "S_" for static functions -- so SI_compare_by_ord2() and friends are
> > static inline.)
> But you're trading off a switch statement instead...

A switch statement which compiles to a jump table is cheaper than a method
call -- no function call overhead, and except for the scoring case, these are
leaf blocks.

The goal for SortCollector is not only to avoid code duplication a la  I'm also hoping to write something *faster*.

> and since you're not specializing, you're eg making the common case (sort by
> one field) pay the looping overhead of the uncommon case.

A "do-while" loop was chosen over a "for" loop so that we wouldn't incur
looping overhead costs unless the first comparison produced a tie.

Even better, I've figured out how to incorporate the "queue full" check into
the switch: we start out with the first comparator type set to AUTO_ACCEPT,
then change it once the queue is full.

    if (SI_competitive(self, doc_id)) {
        /* ... */

        /* Insert the new MatchDoc. */
        self->bumped = (MatchDoc*)HitQ_Jostle(self->hit_q, (Obj*)match_doc);

        if (!self->bumped) {
            /* The queue isn't full yet. */
            VArray *values = self->need_values 
                           ? VA_new(self->num_rules)
                           : NULL;
            float score = self->need_score ? F32_NEGINF : F32_NAN;
            self->bumped = MatchDoc_new(I32_MAX, score, values);
        else if (self->bumped == match_doc) {
            /* The queue is full, and we have established a threshold for this
             * segment as to what sort of document is definitely not
             * acceptable.  Turn off AUTO_ACCEPT and start actually testing
             * whether hits are competitive. */
            self->actions      = self->derived_actions;
            self->bubble_doc   = doc_id;
            self->bubble_score = match_doc->score;

With regards to the "common case (sort by one field)"... it probably makes
sense to consider two common cases:

  * Sort by one field then doc id.
  * Sort by score then doc id.

Sort-by-field requires polymorphism, which SortCollector's switch() statement
implements at less expense than's method calls.  Here's
an abridged version of the present implementation for SI_competitive(), 
followed by a single-comparator variant:

    /* Multi-comparator sort-by-field. */
    static INLINE bool_t
    SI_competitive(SortCollector *self, i32_t doc_id)
        u8_t *const actions = self->actions;
        u32_t i = 0;

        do {
            switch (actions[i]) {
                case AUTO_ACCEPT:
                    return true;
                /* ... */
        } while (++i < self->num_comps);
        return false;

    /* Single-comparator sort-by-field. */
    static INLINE bool_t
    SI_competitive(SingleCompSortCollector *self, i32_t doc_id)
        switch (self->action) {
            /* ... */
        return false;

The only savings we by breaking out a single-comparator subclass are:

  * A single iterator variable initialization (i = 0).
  * A single array deref at the top (actions[i]).
  * Loop termination overhead -- an increment, a dereferenced load, and a
    compare op -- but only in the event of a tie.

I just don't see how that matters so much that we need to double our class

With regards to sort-by-score.... We have an opportunity to achieve meaningful
gains with specialized collection for term scorers once we implement PFOR, and
it's counter-productive to spend time writing, testing, and maintaining a
TopDocCollector class that doesn't go far enough.

> > I found one additional inefficiency in the Lucene implementation: score() is
> > called twice for "competitive" docs.  (This isn't true for
> > TopScoreDocCollector, which is the main class for collecting scored docs --
> > it's only true for the TopFieldDoc implementations that also track score). It
> > would be nice to avoid that, but it probably doesn't amount to much.
> Please give details on where that's happening... we tried not to do
> that.

This was an error.  I missed the ScoreCachingWrapperScorer aspect of

> > For ordinary searches, sort order is determined by score, with doc num
> > breaking ties.  This behavior can be achieved using SortCollector like so:
> >
> >    if (!sort_spec) {
> >        self->comparator_types[0] = HitComp_BY_SCORE;
> >        self->comparator_types[1] = HitComp_BY_DOC;
> >    }
> The thing is, often that tie breaking isn't needed (since most Scorers
> visit docs in order) in the first and by far most frequently executed
> comparison (compareBottom in Lucene), that tests whether the hit
> is even competitive.

Yes, you're right.

I've changed the implementation to accommodate that point.  HitQueue now tacks
on the COMPARE_BY_DOC_ID action, but SortCollector doesn't.  SortCollector's
SI_competitive() test returns false if all comparators tie.

> > For most Scorers, the
> > cost of calling Scorer_Next() and Scorer_Score(), plus the overhead of calling
> > SortCollector_Collect() itself, will swamp those savings.
> I find this logic dangerous and somewhat "self fulfilling".  To say
> "it's in the noise so we won't bother saving it" eventually leads over
> time to more and more noise, thus perpetuating that reasoning.  Savings
> are savings.  Think of how much electricity you're about to spend...

Why spend time on stuff that doesn't matter?

An opposing philosophy, popular in dynamic language development communities,
prioritizes developer time over CPU time:

  Step 1: Write code that's correct.
  Step 2: If it's not fast enough, profile it to identify trouble spots
          and work on the worst offenders first.  Stop optimizing as soon as
          the code is fast enough.

I don't think that approach is appropriate for systems programming, but it
highlights an important point: developer time is a finite resource.

We seem to agree that working hard to get the implementation as
theoretically efficient as possible on the first go-round is important.
Where we seem to diverge is what happens when there's a tradeoff between
theoretical efficiency and code simplicity.  However, since you aren't
advocating wholesale porting to processor-specific assembler, the difference of
opinion is only of degree.

In my opinion, aggressive unrolling only makes sense for end-of-life code
bases where no further algorithmic improvements are possible.  Before that,
it's a distraction.

This SortCollector/TopDocCollector dichotomy is a good example.  I don't want
to waste time writing, testing, maintaining, and ultimately deprecating a
TopDocCollector which saves a few meaningless cycles when we could be focusing
on how to speed single-term queries with scored collection that works right up
against mmap'd posting files and eliminates all function-call overhead.

Until we get a juicy patch like that, I'm content to leave a single
SortCollector in place, with its reasonably simple and easy-to-grok
implementation taunting optimization-crazed developers and daring them to do

> This sounds like a baby step towards source code specialization... you're
> conflating matching & collection (maybe only for TermQuery?).

Only because...

  1) Single-term searches are extremely common.
  2) It's a unique case where function call overhead would be dominated by
     HC_Collect() rather than Scorer_Next() and Scorer_Score().  That wouldn't
     be the case with any compound scorer.

I don't actually think it's a step towards incorporating a compiler.  It's a
one-off optimization.

Compilers are sometimes the only way to solve certain problems, and sometimes
they are the best solutions -- I certainly believe that's the case with
Boilerplater and the autogeneration of Lucy bindings.  However, compilers
introduce a layer of indirection between the programmer and the code that can
make the implementation difficult to spelunk or debug, so equivalent literal
code is often preferable.

> You can push it further: take that bottom value comparison and push
> it down into your TermDocs iteration as it reads the docs from PFOR.

I like it!  :)

Marvin Humphrey

View raw message