incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: SortCollector
Date Fri, 01 May 2009 23:02:04 GMT
On Fri, May 01, 2009 at 01:18:22PM -0600, Nathan Kurz wrote:
> but I'm not quite getting the bigger picture.  Could you contextualize a
> little?

When iterating over hits, SortCollector maintains the priority queue which
keeps track of the highest sorting/scoring documents.

  while (0 != (doc_num = Scorer_Next(scorer))) {
    HC_Collect(collector, doc_num);

Because Collect() is invoked for every hit, keeping its cost down is a major

> My instinct is that you're putting a little too much knowledge of the
> internals here into SortCollector.  

It's true that we're utilizing a switch based on a limited set of internal
APIs, rather than allowing for arbitrary extensibility through polymorphism.

However, it's possible to write your own HitCollector subclass.

> Is there a difference between 'Score' and 'Float'?  

Yes, there is.  The HitCollector caches a Matcher/Scorer.  SortCollector would
only ask that Matcher/Scorer to calculate a score if needed.  

For floats, we should compare values directly, as Mike pointed out in his
reply.  I asserted that we should use ords for integer types in my initial
post, but that's not the right way to do things.

> Between Ord8 and Int8?

Hmm, they're not really the same thing.  "TextOrd8" instead of "Ord8" might
make things clearer.

> And will there be a way to just fall back on String comparisons,
> bypassing the Sort cache?  

I don't think so.

The sort cache is not just the array of ords; it's also the values.  Fields
that aren't marked as "sortable" don't have easily accessible values -- you
have to go into the doc storage and deserialize the whole document to get at

> I'd like to view the cache as an optimization, like adding an index to
> database column[1].  

Well, this is one of the ways that Lucy differs from traditional databases.
It's optimized for maximally fast searches, in exchange for degraded
index-time performance and convenience.  You can't just ask it to sort on any
field; you need to plan which fields to sort on in advance.

> But it feels like you are making it a requirement.

We also need to be able to compare values directly when reconciling hits from
different segments.  One of the things I didn't mention in the original post
was that SortCollector as it was being described would only work for single

But the post was long enough already.  :)

> As always, I'm worried that maybe I won't be able to abuse your
> interface in ways it was not designed.  :)  In particular, I wonder
> about integrating hits from multiple networked machines and being able
> to iteratively return all the sorted results for big databases.

Perceptive of you to pick up on that.  :) 

In theory, that problem is solved.  The ords arrays are only used within
segments.  When reconciling hits from different segments -- or different
machines (!) -- we use real values.

In practice, the code that performs the "real value" comparisons still has to
be written, and I don't yet have a plan for that that I'm happy with.
(There's old code in KS that does it, but it's always sucked and I've been
looking forward to junking it for a long time.)

> [1]  It does feel like you're redeveloping a lot of SQLite here!

Well, partly that's intentional, in that I'm trying to make sure we exploit
existing database development theory whenver we can.  But they're still
different beasts, even if we add SQL-esque types, because of the bias towards
search-time over index-time.

Marvin Humphrey

View raw message