Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 58693 invoked from network); 13 Aug 2007 18:34:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Aug 2007 18:34:11 -0000 Received: (qmail 20153 invoked by uid 500); 13 Aug 2007 18:34:02 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 20125 invoked by uid 500); 13 Aug 2007 18:34:02 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 20114 invoked by uid 99); 13 Aug 2007 18:34:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2007 11:34:02 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of peterlkeegan@gmail.com designates 209.85.146.179 as permitted sender) Received: from [209.85.146.179] (HELO wa-out-1112.google.com) (209.85.146.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2007 18:33:58 +0000 Received: by wa-out-1112.google.com with SMTP id j40so2136537wah for ; Mon, 13 Aug 2007 11:33:37 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type; b=tVs9oXMBnRq1w/E/8M6CoBS2Rf2yGJRhGGnlnvzNrVoA6Ym/eKut+ZAahYYcYjgDrXwkR9QydpLlyiupX5Cy/Xv5rxnLQl1xWFQneXBopx6o52t4Jbc9LkwQ9R4F7zH+CihBrhR8cBI7tBt0zlHrEJZr45E525M/j+GRNLA8GhM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type; b=HmQu81xCh4XNlWbHnn9TQ9zo9NBSdpkk/NnCkZCjcYz30287jXezHlkWFKbriY0/v1KAiCuc/mN0rAcrvSzFaaONo5lTuU5zRgk9/w7u0ynaHU+NZsT0SvuL854PY8IYETf3wPtKidFAqmpXcg774u0yV2vkO2HhNG0csvrePgo= Received: by 10.114.152.17 with SMTP id z17mr2022415wad.1187030017257; Mon, 13 Aug 2007 11:33:37 -0700 (PDT) Received: by 10.115.94.16 with HTTP; Mon, 13 Aug 2007 11:33:37 -0700 (PDT) Message-ID: Date: Mon, 13 Aug 2007 14:33:37 -0400 From: "Peter Keegan" To: java-user Subject: SpanQuery and database join MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_188998_12322085.1187030017151" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_188998_12322085.1187030017151 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline I've been experimenting with using SpanQuery to perform what is essentially a limited type of database 'join'. Each document in the index contains 1 or more 'rows' of meta data from another 'table'. The meta data are simple tokens representing a column name/value pair ( e.g. color$red or location$123). Each row is represented by a span with a maximum token length equal to the maximum number of meta data columns. If a column has multiple values, they are all indexed at the same position ( e.g. color$red, color$blue). All rows are added to a single field. The spans are 'separated' from each other by introducing a position gap between them via ' Analyzer.getPositionIncrementGap'. This gap should be greater than the number of columns in each span. At query time, a SpanNearQuery is constructed to represent the meta data to join. The 'slop' value is set to the maximum number of meta data columns (minus 1). Using a simple Antlr parser, boolean span queries with AND, OR, NOT can be constructed fairly easily. The SpanQuery is And'd to the main query to build the final query. This approach is flexible and pretty efficient because no stored fields or external data are accessed at query time. Span queries are more expensive compared than other queries, though. We measure performance via throughput (as opposed to the response time for a single query), and the addition of a SpanQuery reduced throughput by 5X for ordered spans and 10X for unordered spans. Still, this may be acceptable for some applications, especially if spans are not used on every query. I thought this might interest some of you. Peter ------=_Part_188998_12322085.1187030017151--