lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edoardo Marcora <edoardo.marc...@gmail.com>
Subject RE: Boolean query with 50,000 clauses! Possible? Scalable?
Date Tue, 28 Jul 2009 16:09:58 GMT


Uwe Schindler wrote:
> 
> In general, you should not think in terms of relational databases when
> looking at full text engines. In most cases you have a relational database
> somewhere in background and replicate the information into your index.
> 
> You should index those items as "documents" that you would like to search
> for. If the use can search for books, index each book as a document
> instance. The authors attached to each book are simply additional fields
> in
> this book (and do not relate them, just put the authors into the docs). So
> remove all normalization and put all terms to search for in the base docs.
> Because of the inverted index, the index does not really get bigger
> because
> of it, because each author name (called a term) is indexed only once and
> related internally to the books.
> 
> If you want to have an search engine also for the authors, create a
> separate
> index for the authors. If the user founds one author and wants to see all
> books, start a query to the book index using the author name.
> 

My use case is very similar to yours. In fact I am also denormalizing a huge
scientific database(s) for ease of search. Also, I would like to keep two
indexes like you've indicated, one for books and one for authors. As you
suggested author names are embedded as multivalue attributes in the book
document.

However, that you are suggesting is to query the author index (for example
for authors weighing more than 200lbs) and then for each author query the
book index with the author name. The problem is: the author query could
return tens if not hundreds of thousands author names. It would be
unreasonable to loop through each one of them and start a query to the book
index at each step. I was wondering whether solr/lucene allows for this sort
of intersections to be done at the server level, not at the client level.
What I would like to see in lucene/solr is to query an index with an a large
array of values for a specific field (in this case something like
"book.author_name IN ([author0.name, author1.name, ..., authorN.name])" the
author name array being the result of a "subquery/nested query" that could
return a large number of hits.

Thanx for your help and consideration,

Dado
-- 
View this message in context: http://www.nabble.com/Boolean-query-with-50%2C000-clauses%21-Possible--Scalable--tp24664839p24702697.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Mime
View raw message