Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 39999 invoked from network); 25 Feb 2011 20:53:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Feb 2011 20:53:49 -0000 Received: (qmail 7136 invoked by uid 500); 25 Feb 2011 20:53:48 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 7059 invoked by uid 500); 25 Feb 2011 20:53:48 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 7045 invoked by uid 99); 25 Feb 2011 20:53:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Feb 2011 20:53:48 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of frederik.kraus@gmail.com designates 209.85.161.48 as permitted sender) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Feb 2011 20:53:40 +0000 Received: by fxm2 with SMTP id 2so2623498fxm.35 for ; Fri, 25 Feb 2011 12:53:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:date:from:to:message-id:subject:x-mailer :mime-version:content-type:content-transfer-encoding; bh=XqlzSw3HSj+hcOnSWSuJIwS/Av5aDAo6KrOHRCdpkyI=; b=RmojZk06bhP6MZNiCb7w5qF3pOa+AsW71T99byoYQp67uURzU2GalQM2Pz7rwOAj4n 27zFwgjLUhw7O0Klqb9AogGp/SnzGkTVsO2AbWuh74FoRmeym2Pu8LEwjjDUeyuZV4sO sv0qXS0KNBZCGH98o3+COxqkaXRerkYJgumzs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:message-id:subject:x-mailer:mime-version:content-type :content-transfer-encoding; b=PXnqWeIpcmDzsXNRsXjs+hawz/LuqRbJN2HrfLjWQzvyJwFz6WxHCSvllxnJ//4OM3 8crEcwuTf/lzvyPD2BHDZgLDssVLnWchiRjkHxu27JE17oTUPK7UBtwb4UXXRANNTrJ1 TKBbHI8BLnE6uFl0KdKBX90RCZbjGVJgK0pwA= Received: by 10.223.145.15 with SMTP id b15mr3261058fav.42.1298667199162; Fri, 25 Feb 2011 12:53:19 -0800 (PST) Received: from fmk-2.local (91-65-100-222-dynip.superkabel.de [91.65.100.222]) by mx.google.com with ESMTPS id f24sm537343fak.0.2011.02.25.12.53.17 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 25 Feb 2011 12:53:18 -0800 (PST) Date: Fri, 25 Feb 2011 21:53:16 +0100 From: Frederik Kraus To: dev@lucene.apache.org Message-ID: <3A19624083334BE18BCC2B34C27F0202@gmail.com> Subject: FilterQuery Performance Optimizations X-Mailer: sparrow 1.0.1 (build 589.15) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="4d6816bc_7ab49daf_6e46" Content-Transfer-Encoding: 8bit --4d6816bc_7ab49daf_6e46 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Content-Disposition: inline Hi Guys, testing performance of complex filter queries on a rather large index, I ran into a few points I'd like to share and put up for discussion: Let's say we have the following two filter queries: fq=someField:(123 OR 234 OR 235) fq=someField:(234 OR 123 OR 235) Currently the filterCache treats those two queries as two distinct queries, where really they are logically the same. Wouldn't it make more sense to internally sort this kind of logical OR query to reduce the number of distinct queries and at the same time increase the cache hits? This also applies to the "AND" case (multivalue), even though you can obviously circumvent that issue via splitting: fq=someField:(234 AND 123 AND 235) into: fq=someField:234&fq=someField:123&fq=someField:235 Going even one step further - might it not make sense to split up OR queries into individual filterQueries (much like the AND case, but internally), and then creating a UNION instead of an intersection as with the standard fq-chaining Fred. --4d6816bc_7ab49daf_6e46 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline
Hi Guys,

testing performance of complex filter queries on a rather large index, = I ran into a few points I'd like to share and put up for discussion:

Let's say we have the following two filter queries:<= /div>

fq=3Dsome=46ield:(123 OR 234 OR 235)
fq=3Dsome=46ield:(234 OR 123 OR 235)

Currently the filterCache treats those two queries as t= wo distinct queries, where really they are logically the same.
=
Wouldn't it make more sense to internally sort this kind o= f logical OR query to reduce the number of distinct queries and at the sa= me time increase the cache hits=3F

This also app= lies to the =22AND=22 case (multivalue), even though you can obviously ci= rcumvent that issue via splitting:

fq=3Dsome=46i= eld:(234 AND 123 AND 235)

into:
<= div>
fq=3Dsome=46ield:234&fq=3Dsome=46ield:123&fq=3D= some=46ield:235

Going even one step further - mi= ght it not make sense to split up OR queries into individual filterQuerie= s (much like the AND case, but internally), and then creating a UNION ins= tead of an intersection as with the standard fq-chaining


=46red.


<= br>
--4d6816bc_7ab49daf_6e46--