Mailing-List: contact general-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
From: "Uwe Schindler" <uwe@thetaphi.de>
To: <general@lucene.apache.org>
References: <24664839.post@talk.nabble.com> <4A6CA0C5.2070209@ice-sa.com>
 <24670355.post@talk.nabble.com> <p0624080fc69277056b73@[192.168.1.43]>
 <24671050.post@talk.nabble.com>
 <D4FBE517-FAFF-4220-A8AD-7CE72C4449A1@apache.org>
 <24701672.post@talk.nabble.com>
Subject: RE: Boolean query with 50,000 clauses! Possible? Scalable?
Date: Tue, 28 Jul 2009 17:46:41 +0200
Message-ID: <44FC07357EE24E8DB856430E5829A63C@VEGA>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
In-Reply-To: <24701672.post@talk.nabble.com>
Thread-Index: AcoPl31KYLmLVVBATWeeGVmgfOEbuAAAeyUg

And again the other part, sorry did not read it exactly:

> I have two sets of documents indexed. The two sets are distinguished by a
> "kind" attribute/field. For example, 'Book' and 'Author'. Books are
> uniquely
> identified by a "ISBN" attribute/field (PK). Authors contains an "Authored
> book" attribute/field containing an ID matching one of the Book.ISBN
> values
> (FK).

In general, you should not think in terms of relational databases when
looking at full text engines. In most cases you have a relational database
somewhere in background and replicate the information into your index.

You should index those items as "documents" that you would like to search
for. If the use can search for books, index each book as a document
instance. The authors attached to each book are simply additional fields in
this book (and do not relate them, just put the authors into the docs). So
remove all normalization and put all terms to search for in the base docs.
Because of the inverted index, the index does not really get bigger because
of it, because each author name (called a term) is indexed only once and
related internally to the books.

If you want to have an search engine also for the authors, create a separate
index for the authors. If the user founds one author and wants to see all
books, start a query to the book index using the author name.

We use a similar technique to replicate our complete full-nomrmalized
metadata database into a lucene index (www.pangaea.de). The user wants to
search for scientific datasets, so the entity in the index is a scientific
dataset. All author names and other info like affiliation are indexex
multiple times for each dataset. The only problem is correctly rebuilding
the index, when e.g. one author changes and you may have to reindex hundreds
of datasets (books in your case) because the author name has changed. We
have a special replication software for that, that queues a update for all
related table rows on any update.

Uwe

> Grant Ingersoll-6 wrote:
> >
> > This strikes me as an example of
> > http://people.apache.org/~hossman/#xyproblem
> >    Namely, you've declared the solution you would like, but haven't
> > told us the problem.
> >
> > I highly doubt that double loop is going to scale.  It wouldn't scale
> > in a database, either, so it makes me think we need to take a step
> > back and ask a bit more about the problem you are trying to solve and
> > not the solution.  Can you share more details about it?
> >
> > On Jul 26, 2009, at 6:14 PM, Edoardo Marcora wrote:
> >
> >>
> >> type:foo and type:bar are fields used to represent documents of
> >> different
> >> "kind" (it could be "author" and "book"). field2 and field1 contains
> >> IDs
> >> which I would like to use to join the two "kinds".
> >>
> >>
> >> Ken Krugler wrote:
> >>>
> >>>> awarnier wrote:
> >>>>>
> >>>>> Edoardo Marcora wrote:
> >>>>>> I am faced with the requirement for a boolean query composed of
> >>>>>> 50,000
> >>>>>> clauses (all of them directed at the same field) all OR'ed
> >>>>>> together.
> >>>>>>
> >>>>> By pure intellectual curiosity : can you provide some idea of the
> >>>>> type
> >>>>> of query, and the type of content of the field this is targeted
> >>>>> at ?
> >>>>> I have this notion that with 50,000 queries directed at one field,
> >>>>> there
> >>>>> must be some smarter way of handling this than just OR-ing
> >>>>> together the
> >>>>> results.
> >>>>>
> >>>>>
> >>>>
> >>>> What I would like to do is to take the results of one query and
> >>>> use one of
> >>>> its fields (not the docid) as an argument to another query (much
> >>>> like a
> >>>> subquery in SQL). For example:
> >>>>
> >>>> type:foo AND (_query_:type:bar AND field2:{field1})
> >>>>
> >>>> This should search for all types of foo and then iterate over the
> >>>> result
> >> set
> >>>> and perform a query for where type is bar and field2 is equal to
> >>>> the value
> >>>> of field1 from each item of the first result set.
> >>>
> >>> This looks like a more like this (MLT) query, where you restrict the
> >>> set to documents that have matching types...though I don't understand
> >>> the type:foo AND type:bar query, unless 'type' is a multi-value
> >>> field.
> >>>
> >>> From what I remember of using MLT support in Lucene a few years back,
> >>> this takes the terms of the target field from the target document,
> >>> tosses out stop words, and then uses some arbitrary limit (e.g. 500)
> >>> for the first N terms used to do the query.
> >>>
> >>> Unless the distribution of terms in the field is heavily skewed, this
> >>> gives you pretty good results. I supposed you could use the N most
> >>> common terms - but your stop word list isn't good, you'll get worse
> >>> results.
> >>>
> >>> In any case, preprocessing the field will speed things up, versus
> >>> doing any analysis/stop word/frequency calculations at query time.
> >>>
> >>> -- Ken
> >>> --
> >>> Ken Krugler
> >>> <http://ken-blog.krugler.org>
> >>> +1 530-265-2225
> >>>
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Boolean-query-with-50%2C000-clauses%21-Possible--
> Scalable--tp24664839p24671050.html
> >> Sent from the Lucene - General mailing list archive at Nabble.com.
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> > using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
> >
> 
> --
> View this message in context: http://www.nabble.com/Boolean-query-with-
> 50%2C000-clauses%21-Possible--Scalable--tp24664839p24701672.html
> Sent from the Lucene - General mailing list archive at Nabble.com.