From general-return-1577-apmail-lucene-general-archive=lucene.apache.org@lucene.apache.org Sun Jul 26 22:14:01 2009 Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 87153 invoked from network); 26 Jul 2009 22:14:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Jul 2009 22:14:01 -0000 Received: (qmail 484 invoked by uid 500); 26 Jul 2009 22:15:06 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 405 invoked by uid 500); 26 Jul 2009 22:15:06 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 395 invoked by uid 99); 26 Jul 2009 22:15:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Jul 2009 22:15:06 +0000 X-ASF-Spam-Status: No, hits=1.3 required=10.0 tests=PLING_QUERY,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Jul 2009 22:14:55 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1MVBzG-0003s6-GC for general@lucene.apache.org; Sun, 26 Jul 2009 15:14:34 -0700 Message-ID: <24671050.post@talk.nabble.com> Date: Sun, 26 Jul 2009 15:14:34 -0700 (PDT) From: Edoardo Marcora To: general@lucene.apache.org Subject: Re: Boolean query with 50,000 clauses! Possible? Scalable? In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: edoardo.marcora@gmail.com References: <24664839.post@talk.nabble.com> <4A6CA0C5.2070209@ice-sa.com> <24670355.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org type:foo and type:bar are fields used to represent documents of different "kind" (it could be "author" and "book"). field2 and field1 contains IDs which I would like to use to join the two "kinds". Ken Krugler wrote: > >>awarnier wrote: >>> >>> Edoardo Marcora wrote: >>>> I am faced with the requirement for a boolean query composed of 50,000 >>>> clauses (all of them directed at the same field) all OR'ed together. >>>> >>> By pure intellectual curiosity : can you provide some idea of the type >>> of query, and the type of content of the field this is targeted at ? >>> I have this notion that with 50,000 queries directed at one field, >>> there >>> must be some smarter way of handling this than just OR-ing together the >>> results. >>> >>> >> >>What I would like to do is to take the results of one query and use one of >>its fields (not the docid) as an argument to another query (much like a >>subquery in SQL). For example: >> >>type:foo AND (_query_:type:bar AND field2:{field1}) >> >>This should search for all types of foo and then iterate over the result set >>and perform a query for where type is bar and field2 is equal to the value >>of field1 from each item of the first result set. > > This looks like a more like this (MLT) query, where you restrict the > set to documents that have matching types...though I don't understand > the type:foo AND type:bar query, unless 'type' is a multi-value field. > > From what I remember of using MLT support in Lucene a few years back, > this takes the terms of the target field from the target document, > tosses out stop words, and then uses some arbitrary limit (e.g. 500) > for the first N terms used to do the query. > > Unless the distribution of terms in the field is heavily skewed, this > gives you pretty good results. I supposed you could use the N most > common terms - but your stop word list isn't good, you'll get worse > results. > > In any case, preprocessing the field will speed things up, versus > doing any analysis/stop word/frequency calculations at query time. > > -- Ken > -- > Ken Krugler > > +1 530-265-2225 > -- View this message in context: http://www.nabble.com/Boolean-query-with-50%2C000-clauses%21-Possible--Scalable--tp24664839p24671050.html Sent from the Lucene - General mailing list archive at Nabble.com.