incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: SortCollector
Date Wed, 20 May 2009 03:14:54 GMT
On Sat, May 02, 2009 at 12:51:22PM -0600, Nathan Kurz wrote:

> > When iterating over hits, SortCollector maintains the priority queue which
> > keeps track of the highest sorting/scoring documents.
> I did get that, but I'm still confused about who owns the SortCache.

The SortCache belongs to a SortReader.  SortReader is a SegReader component.

Right now SortReader is a plain old class.  In the future, it will be an
abstract class, and you will be able to specify your own subclass if you wish,
using Architecture.

> Is it considered to be just another index format?  

We will codify the sort cache index format in the Lucy file format spec.

> Is the SortCollector in addition to the HitCollector or a substitute

HitCollector is an abstract class.  SortCollector is a subclass of

> And does the integration of results from multiple servers (or segments)
> happen above or below this?

Integration happens within SortCollector's HitQueue.

SortCollector's SI_competitive() routine determines whether a doc has a chance
of survival within the HitQueue.  That step uses ords.

Once a doc passes SI_competitive(), field values are retrieved from the cache
and tacked onto a MatchDoc which gets inserted into the HitQueue.  Within the
HitQueue, actual values are compared and thus hits from multiple segments can
compete with each other on equal footing.

Scorer's are compiled per-segment.  Before each scorer is processed,
HitCollector's Set_Reader(), Set_Base(), and Set_Scorer() methods must be
called.  It is during SortColl_Set_Reader() that the SortCaches for the
segment are aquired by the SortCollector.

>From Searcher_Collect(), in Searcher.c:

    /* Accumulate hits into the HitCollector. */
    for (i = 0, max = VA_Get_Size(seg_readers); i < max; i++) {
        SegReader *seg_reader = (SegReader*)VA_Fetch(seg_readers, i);
        Scorer    *scorer     = Compiler_Make_Scorer(compiler, seg_reader);
        if (scorer) {
            i32_t seg_start = I32Arr_Get(seg_starts, i);
            Matcher *deletions = SegReader_Deletions(seg_reader);
            HC_Set_Reader(collector, seg_reader);
            HC_Set_Base(collector, seg_start);
            HC_Set_Scorer(collector, scorer);
            Scorer_Collect(scorer, collector, deletions);

> For example, assume I've got a corpus of movies reviews and I want to
> search for "fulltext:insightful rating>3 date>01/01/2008" and I to
> return results ordered by "date".   I have real-time updates to the
> index and thus have multiple segments.   What happens?

That's an ANDQuery with three children: a TermQuery and two RangeQueries.
They compile down to an ANDScorer, which matches a set of doc ids.

As you iterate over the ANDScorer's doc ids, each one gets fed to
SortColl_Collect().  Within the SortCollector, results are ordered using the
'date' field's sort cache: first using ords at the SI_competitive() stage, then
using values within the HitQueue.

It so happens that one of the RangeScorer children of the ANDScorer is also
accessing the 'date' field's sort cache, but that's to find matches -- a
separate concern from ordering.

> > For floats, we should compare values directly, as Mike pointed out in his
> > reply.  I asserted that we should use ords for integer types in my initial
> > post, but that's not the right way to do things.
> This seems right, and I wonder further if you could simplify the logic
> in your SortCollector by casting the all the Ords to Ints and then
> using a single comparison function.  Store them as packed as you wish,
> but upgrade them to full width before doing the comparison.
> Ideally, this would be transparent to the SortCollector and completely
> internal to the SortCache.   

Well, I'm trying to avoid extra method calls within SortCollector.  The idea
is to cache the results of your potentially sluggish comparison op at index

We still need to call the potentially sluggish comparison op within the
HitQueue, but that scales with the number of hits requested rather than the
size of the index, and should be manageable.

> It might even be efficient to cast
> everything to a Float or Double and get rid of the entire switch()
> statement.

I don't see how this would be possible based on the current implementation.
But now there's something to look at, and you might say something different
after having looked at it.  Or not.  :)

> I'm not sure of the exact way to implement this, but I think this idea
> has potential.  Store the ordinals as compactly as you can, but then
> cast them in any order preserving manner so that sorting is equivalent
> to scoring.

I think I've got retrieving and comparing the ords pretty much maxed out
speed-wise, as far as what I'd been planning to do.  Hopefully I'm wrong about
that, though.  :)

> So the SortCache is an optimization internal to a Scorer used to winnow the
> results down to a manageable subset, and we still might need to use more
> expensive means to reconcile multiple result sets?

Pretty much.  The sort cache belongs to the SortReader though, not the Scorer.

The Scorer spits out doc ids.  When sorting by one or more fields, SortCaches
are used to determine the sort order of that doc id within the segment, and
then to retrieve values so that the doc id can be compared with others outside
of its segment.

> I'm still worried about how to go about combining these result sets,
> though, particularly if when doing things like asking for results deep
> within the ordered list (ie, results 1000-1100).  But maybe this can
> just be solved with additional constraints in the query?

Well, that would require pouring N result sets with up to 1100 hits each into
a single 1100-element final HitQueue.  A lot of MatchDoc objects would be
created, and a lot of real-value comparisons would take place.  Oh well,
whatcha gonna do?

Marvin Humphrey

View raw message