Return-Path: X-Original-To: apmail-incubator-lucy-dev-archive@www.apache.org Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A6130428D for ; Tue, 21 Jun 2011 18:26:10 +0000 (UTC) Received: (qmail 79173 invoked by uid 500); 21 Jun 2011 18:26:10 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 79120 invoked by uid 500); 21 Jun 2011 18:26:09 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 79112 invoked by uid 99); 21 Jun 2011 18:26:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jun 2011 18:26:09 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jun 2011 18:26:03 +0000 Received: from marvin by rectangular.com with local (Exim 4.69) (envelope-from ) id 1QZ5Ts-0005pv-QI; Tue, 21 Jun 2011 11:15:20 -0700 Date: Tue, 21 Jun 2011 11:15:20 -0700 From: Marvin Humphrey To: lucy-dev@incubator.apache.org, peter@peknet.com Message-ID: <20110621181520.GA22200@rectangular.com> References: <4E002F53.3080507@peknet.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4E002F53.3080507@peknet.com> User-Agent: Mutt/1.5.18 (2008-05-17) X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] RangeQuery and multi-value fields On Tue, Jun 21, 2011 at 12:42:43AM -0500, Peter Karman wrote: > Lucy does not (yet) support multi-value fields natively. While it is true that Lucy has a different document model from Lucene and does not support "multi-value fields", Lucene's "multi-value field" model does not support sorting on such fields, either. * Lucene sorts on the first (i.e. lexically-least) token in a multi-value field. * Lucy sorts on the complete field value. In both cases, there is only one value associated with each document which determines sort order. Building a performant sort structure that behaves any other way is extremely challenging. > I want to override the behavior of the RangeQuery class to support my pseudo > multi-value fields, which I achieve by concatenating values with the \x03 byte. I suggest adding the document multiple times, once for each unique value of the multi-value field you want to sort on. (That's what I've done when faced with this problem.) Theoretically, there's another option: capture all hits and "post-sort" them outside of Lucy after they are returned. However, that requires that you figure out what value within the field matched for each document, which is going to be hard and slow. And of course the approach won't scale. > It looks like there are 2 possible approaches: > > * override the static methods in core/Lucy/Search/RangeQuery.c for > find_lower_bound() and find_upper_bound(), or > > * the core/Lucy/Index/SortCache.c Find() and Value() functions, so that they > can split the field values on the \x03 delimiter and treat each substring as a > separate value. > > It seems like that second option is the better one since that should also affect > sorting, which would be a nice side effect. Maybe. :/ The real problem is the ords array, which is immutable, single-value-per-document, and built at index-time. Most sorting is not done according to the what SortCache_Value() returns; the bulk of the comparisons are performed within core/Lucy/Search/Collector/SortCollector.c using routines such as this one: static INLINE int32_t SI_compare_by_ord4(SortCollector *self, uint32_t tick, int32_t a, int32_t b) { void *const ords = self->ord_arrays[tick]; int32_t a_ord = NumUtil_u4get(ords, a); int32_t b_ord = NumUtil_u4get(ords, b); return a_ord - b_ord; } For background, see the "Sort file format" thread from 2009 where Mike McCandless and I hashed out the design: http://markmail.org/message/cfajhki4px5l34fm > I'm wondering if (a) SortCache or RangeCompiler could/should be exposed as > public classes for overriding, and/or (b) if I'm just way off on this line of > thought. At some point, it would be nice to expose hooks which would allow sorting of search results by some arbitrary external resource. That would allow us to take the non-scaling post-sort approach described above and execute it during the initial search, eliminating the need to capture all documents prior to sorting. Another theoretical possibility is proposed here: http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene https://issues.apache.org/jira/browse/LUCENE-2454 Both approaches are difficult and would require a lot of work. I'm unenthusiastic about the second option because it would entail adding a lot of complexity to Lucy's internals. Marvin Humphrey