Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 11798 invoked from network); 28 Jul 2009 15:53:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Jul 2009 15:53:29 -0000 Received: (qmail 30133 invoked by uid 500); 28 Jul 2009 15:47:26 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 30104 invoked by uid 500); 28 Jul 2009 15:47:26 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 30094 invoked by uid 99); 28 Jul 2009 15:47:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 15:47:26 +0000 X-ASF-Spam-Status: No, hits=2.5 required=10.0 tests=PLING_QUERY,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 15:47:15 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id 19D6A45ED5D for ; Tue, 28 Jul 2009 17:46:54 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id uCLwvgVE4NPA for ; Tue, 28 Jul 2009 17:46:42 +0200 (CEST) Received: from VEGA (dslb-088-065-126-173.pools.arcor-ip.net [88.65.126.173]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id 6E72345ED5F for ; Tue, 28 Jul 2009 17:46:42 +0200 (CEST) From: "Uwe Schindler" To: References: <24664839.post@talk.nabble.com> <4A6CA0C5.2070209@ice-sa.com> <24670355.post@talk.nabble.com> <24671050.post@talk.nabble.com> <24701672.post@talk.nabble.com> Subject: RE: Boolean query with 50,000 clauses! Possible? Scalable? Date: Tue, 28 Jul 2009 17:46:41 +0200 Message-ID: <44FC07357EE24E8DB856430E5829A63C@VEGA> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <24701672.post@talk.nabble.com> Thread-Index: AcoPl31KYLmLVVBATWeeGVmgfOEbuAAAeyUg X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579 X-Virus-Checked: Checked by ClamAV on apache.org And again the other part, sorry did not read it exactly: > I have two sets of documents indexed. The two sets are distinguished by a > "kind" attribute/field. For example, 'Book' and 'Author'. Books are > uniquely > identified by a "ISBN" attribute/field (PK). Authors contains an "Authored > book" attribute/field containing an ID matching one of the Book.ISBN > values > (FK). In general, you should not think in terms of relational databases when looking at full text engines. In most cases you have a relational database somewhere in background and replicate the information into your index. You should index those items as "documents" that you would like to search for. If the use can search for books, index each book as a document instance. The authors attached to each book are simply additional fields in this book (and do not relate them, just put the authors into the docs). So remove all normalization and put all terms to search for in the base docs. Because of the inverted index, the index does not really get bigger because of it, because each author name (called a term) is indexed only once and related internally to the books. If you want to have an search engine also for the authors, create a separate index for the authors. If the user founds one author and wants to see all books, start a query to the book index using the author name. We use a similar technique to replicate our complete full-nomrmalized metadata database into a lucene index (www.pangaea.de). The user wants to search for scientific datasets, so the entity in the index is a scientific dataset. All author names and other info like affiliation are indexex multiple times for each dataset. The only problem is correctly rebuilding the index, when e.g. one author changes and you may have to reindex hundreds of datasets (books in your case) because the author name has changed. We have a special replication software for that, that queues a update for all related table rows on any update. Uwe > Grant Ingersoll-6 wrote: > > > > This strikes me as an example of > > http://people.apache.org/~hossman/#xyproblem > > Namely, you've declared the solution you would like, but haven't > > told us the problem. > > > > I highly doubt that double loop is going to scale. It wouldn't scale > > in a database, either, so it makes me think we need to take a step > > back and ask a bit more about the problem you are trying to solve and > > not the solution. Can you share more details about it? > > > > On Jul 26, 2009, at 6:14 PM, Edoardo Marcora wrote: > > > >> > >> type:foo and type:bar are fields used to represent documents of > >> different > >> "kind" (it could be "author" and "book"). field2 and field1 contains > >> IDs > >> which I would like to use to join the two "kinds". > >> > >> > >> Ken Krugler wrote: > >>> > >>>> awarnier wrote: > >>>>> > >>>>> Edoardo Marcora wrote: > >>>>>> I am faced with the requirement for a boolean query composed of > >>>>>> 50,000 > >>>>>> clauses (all of them directed at the same field) all OR'ed > >>>>>> together. > >>>>>> > >>>>> By pure intellectual curiosity : can you provide some idea of the > >>>>> type > >>>>> of query, and the type of content of the field this is targeted > >>>>> at ? > >>>>> I have this notion that with 50,000 queries directed at one field, > >>>>> there > >>>>> must be some smarter way of handling this than just OR-ing > >>>>> together the > >>>>> results. > >>>>> > >>>>> > >>>> > >>>> What I would like to do is to take the results of one query and > >>>> use one of > >>>> its fields (not the docid) as an argument to another query (much > >>>> like a > >>>> subquery in SQL). For example: > >>>> > >>>> type:foo AND (_query_:type:bar AND field2:{field1}) > >>>> > >>>> This should search for all types of foo and then iterate over the > >>>> result > >> set > >>>> and perform a query for where type is bar and field2 is equal to > >>>> the value > >>>> of field1 from each item of the first result set. > >>> > >>> This looks like a more like this (MLT) query, where you restrict the > >>> set to documents that have matching types...though I don't understand > >>> the type:foo AND type:bar query, unless 'type' is a multi-value > >>> field. > >>> > >>> From what I remember of using MLT support in Lucene a few years back, > >>> this takes the terms of the target field from the target document, > >>> tosses out stop words, and then uses some arbitrary limit (e.g. 500) > >>> for the first N terms used to do the query. > >>> > >>> Unless the distribution of terms in the field is heavily skewed, this > >>> gives you pretty good results. I supposed you could use the N most > >>> common terms - but your stop word list isn't good, you'll get worse > >>> results. > >>> > >>> In any case, preprocessing the field will speed things up, versus > >>> doing any analysis/stop word/frequency calculations at query time. > >>> > >>> -- Ken > >>> -- > >>> Ken Krugler > >>> > >>> +1 530-265-2225 > >>> > >> > >> -- > >> View this message in context: > >> http://www.nabble.com/Boolean-query-with-50%2C000-clauses%21-Possible-- > Scalable--tp24664839p24671050.html > >> Sent from the Lucene - General mailing list archive at Nabble.com. > >> > > > > -------------------------- > > Grant Ingersoll > > http://www.lucidimagination.com/ > > > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > > using Solr/Lucene: > > http://www.lucidimagination.com/search > > > > > > > > -- > View this message in context: http://www.nabble.com/Boolean-query-with- > 50%2C000-clauses%21-Possible--Scalable--tp24664839p24701672.html > Sent from the Lucene - General mailing list archive at Nabble.com.