incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] RangeQuery and multi-value fields
Date Tue, 21 Jun 2011 18:15:20 GMT
On Tue, Jun 21, 2011 at 12:42:43AM -0500, Peter Karman wrote:
> Lucy does not (yet) support multi-value fields natively.

While it is true that Lucy has a different document model from Lucene and does
not support "multi-value fields", Lucene's "multi-value field" model does not
support sorting on such fields, either. 

    * Lucene sorts on the first (i.e. lexically-least) token in a multi-value
      field.
    * Lucy sorts on the complete field value.

In both cases, there is only one value associated with each document which
determines sort order.

Building a performant sort structure that behaves any other way is extremely
challenging.

> I want to override the behavior of the RangeQuery class to support my pseudo
> multi-value fields, which I achieve by concatenating values with the \x03 byte.

I suggest adding the document multiple times, once for each unique value of the
multi-value field you want to sort on.  (That's what I've done when faced with
this problem.)

Theoretically, there's another option: capture all hits and "post-sort" them
outside of Lucy after they are returned.  However, that requires that you
figure out what value within the field matched for each document, which is
going to be hard and slow.  And of course the approach won't scale.

> It looks like there are 2 possible approaches:
> 
>  * override the static methods in core/Lucy/Search/RangeQuery.c for
> find_lower_bound() and find_upper_bound(), or
> 
>  * the core/Lucy/Index/SortCache.c Find() and Value() functions, so that they
> can split the field values on the \x03 delimiter and treat each substring as a
> separate value.
> 
> It seems like that second option is the better one since that should also affect
> sorting, which would be a nice side effect. Maybe. :/

The real problem is the ords array, which is immutable,
single-value-per-document, and built at index-time.  Most sorting is not done
according to the what SortCache_Value() returns; the bulk of the comparisons
are performed within core/Lucy/Search/Collector/SortCollector.c using routines
such as this one:

    static INLINE int32_t
    SI_compare_by_ord4(SortCollector *self, uint32_t tick, int32_t a, int32_t b) {
        void *const ords = self->ord_arrays[tick];
        int32_t a_ord = NumUtil_u4get(ords, a); 
        int32_t b_ord = NumUtil_u4get(ords, b); 
        return a_ord - b_ord;
    }

For background, see the "Sort file format" thread from 2009 where Mike
McCandless and I hashed out the design:

    http://markmail.org/message/cfajhki4px5l34fm

> I'm wondering if (a) SortCache or RangeCompiler could/should be exposed as
> public classes for overriding, and/or (b) if I'm just way off on this line of
> thought.

At some point, it would be nice to expose hooks which would allow sorting of
search results by some arbitrary external resource.  That would allow us to
take the non-scaling post-sort approach described above and execute it during
the initial search, eliminating the need to capture all documents prior to
sorting.

Another theoretical possibility is proposed here:

    http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene
    https://issues.apache.org/jira/browse/LUCENE-2454

Both approaches are difficult and would require a lot of work.  I'm
unenthusiastic about the second option because it would entail adding a lot of
complexity to Lucy's internals.

Marvin Humphrey


Mime
View raw message