Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 80773 invoked from network); 13 Aug 2007 19:27:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Aug 2007 19:27:31 -0000 Received: (qmail 31437 invoked by uid 500); 13 Aug 2007 19:27:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 31175 invoked by uid 500); 13 Aug 2007 19:27:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 31164 invoked by uid 99); 13 Aug 2007 19:27:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2007 12:27:23 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of peterlkeegan@gmail.com designates 209.85.146.178 as permitted sender) Received: from [209.85.146.178] (HELO wa-out-1112.google.com) (209.85.146.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2007 19:27:19 +0000 Received: by wa-out-1112.google.com with SMTP id j40so2149931wah for ; Mon, 13 Aug 2007 12:26:59 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=l99c7KHBscnbVyAXnWl12uWPYom9bNvzj1PCPxtyS+zqVyFT8u1z0mKNth091Td2AgsCSnznHt7pr0Kja4IcNckYWw479zWJmN2sjbSm7cQMTVH+bM1bSl0345/kZI3UWwM3pybfsgKRnqmG7CHLoaMh5moJng85m2NMOAvl4wA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=T3DcIWZfb2CO1Q1zm0QFkxhaN6Ujg2cjTTebpbE2pgB53UsThyWOCWdb/Sw+LBOu3id5XVtELqwICTyK0/n33FV4r9nuc6HQS74qQR/7m6d0pvq6mzFS3vJ/HqNBOJ8qYCb3IvRqz8PCbhvZVGtRrT/yuxBHcrxGCl5iH+aqQR8= Received: by 10.114.199.1 with SMTP id w1mr3999018waf.1187033218906; Mon, 13 Aug 2007 12:26:58 -0700 (PDT) Received: by 10.115.94.16 with HTTP; Mon, 13 Aug 2007 12:26:58 -0700 (PDT) Message-ID: Date: Mon, 13 Aug 2007 15:26:58 -0400 From: "Peter Keegan" To: java-user@lucene.apache.org Subject: Re: SpanQuery and database join In-Reply-To: <359a92830708131209p62e9988bw9416d9af0fcf2711@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_189971_26262805.1187033218638" References: <359a92830708131209p62e9988bw9416d9af0fcf2711@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_189971_26262805.1187033218638 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I suppose it could go under performance or HowTo/Interesting uses of SpanQuery. Peter On 8/13/07, Erick Erickson wrote: > > Thanks for writing this up. Do you think this is an appropriate subject > for the Wiki performance page? > > Erick > > On 8/13/07, Peter Keegan wrote: > > > > I've been experimenting with using SpanQuery to perform what is > > essentially > > a limited type of database 'join'. Each document in the index contains 1 > > or > > more 'rows' of meta data from another 'table'. The meta data are simple > > tokens representing a column name/value pair ( e.g. color$red or > > location$123). Each row is represented by a span with a maximum token > > length equal to the maximum number of meta data columns. If a column has > > multiple values, they are all indexed at the same position ( e.g. > > color$red, > > color$blue). All rows are added to a single field. The spans are > > 'separated' > > from each other by introducing a position gap between them via ' > > Analyzer.getPositionIncrementGap'. This gap should be greater than the > > number of columns in each span. > > > > At query time, a SpanNearQuery is constructed to represent the meta data > > to > > join. The 'slop' value is set to the maximum number of meta data columns > > (minus 1). Using a simple Antlr parser, boolean span queries with AND, > OR, > > NOT can be constructed fairly easily. The SpanQuery is And'd to the main > > query to build the final query. > > > > This approach is flexible and pretty efficient because no stored fields > or > > external data are accessed at query time. Span queries are more > expensive > > compared than other queries, though. We measure performance via > throughput > > (as opposed to the response time for a single query), and the addition > of > > a > > SpanQuery reduced throughput by 5X for ordered spans and 10X for > unordered > > spans. Still, this may be acceptable for some applications, especially > if > > spans are not used on every query. > > > > I thought this might interest some of you. > > > > Peter > > > ------=_Part_189971_26262805.1187033218638--