lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edoardo Marcora <edoardo.marc...@gmail.com>
Subject Re: Boolean query with 50,000 clauses! Possible? Scalable?
Date Tue, 28 Jul 2009 15:23:33 GMT

I have two sets of documents indexed. The two sets are distinguished by a
"kind" attribute/field. For example, 'Book' and 'Author'. Books are uniquely
identified by a "ISBN" attribute/field (PK). Authors contains an "Authored
book" attribute/field containing an ID matching one of the Book.ISBN values
(FK).

Author documents also have other attributes, for example "Weight". I want a
query that gives every book document authored by people weighing more than
200lbs, with the ability of doing faceting and the likes.



Grant Ingersoll-6 wrote:
> 
> This strikes me as an example of
> http://people.apache.org/~hossman/#xyproblem 
>    Namely, you've declared the solution you would like, but haven't  
> told us the problem.
> 
> I highly doubt that double loop is going to scale.  It wouldn't scale  
> in a database, either, so it makes me think we need to take a step  
> back and ask a bit more about the problem you are trying to solve and  
> not the solution.  Can you share more details about it?
> 
> On Jul 26, 2009, at 6:14 PM, Edoardo Marcora wrote:
> 
>>
>> type:foo and type:bar are fields used to represent documents of  
>> different
>> "kind" (it could be "author" and "book"). field2 and field1 contains  
>> IDs
>> which I would like to use to join the two "kinds".
>>
>>
>> Ken Krugler wrote:
>>>
>>>> awarnier wrote:
>>>>>
>>>>> Edoardo Marcora wrote:
>>>>>> I am faced with the requirement for a boolean query composed of 

>>>>>> 50,000
>>>>>> clauses (all of them directed at the same field) all OR'ed  
>>>>>> together.
>>>>>>
>>>>> By pure intellectual curiosity : can you provide some idea of the  
>>>>> type
>>>>> of query, and the type of content of the field this is targeted  
>>>>> at ?
>>>>> I have this notion that with 50,000 queries directed at one field,
>>>>> there
>>>>> must be some smarter way of handling this than just OR-ing  
>>>>> together the
>>>>> results.
>>>>>
>>>>>
>>>>
>>>> What I would like to do is to take the results of one query and  
>>>> use one of
>>>> its fields (not the docid) as an argument to another query (much  
>>>> like a
>>>> subquery in SQL). For example:
>>>>
>>>> type:foo AND (_query_:type:bar AND field2:{field1})
>>>>
>>>> This should search for all types of foo and then iterate over the  
>>>> result
>> set
>>>> and perform a query for where type is bar and field2 is equal to  
>>>> the value
>>>> of field1 from each item of the first result set.
>>>
>>> This looks like a more like this (MLT) query, where you restrict the
>>> set to documents that have matching types...though I don't understand
>>> the type:foo AND type:bar query, unless 'type' is a multi-value  
>>> field.
>>>
>>> From what I remember of using MLT support in Lucene a few years back,
>>> this takes the terms of the target field from the target document,
>>> tosses out stop words, and then uses some arbitrary limit (e.g. 500)
>>> for the first N terms used to do the query.
>>>
>>> Unless the distribution of terms in the field is heavily skewed, this
>>> gives you pretty good results. I supposed you could use the N most
>>> common terms - but your stop word list isn't good, you'll get worse
>>> results.
>>>
>>> In any case, preprocessing the field will speed things up, versus
>>> doing any analysis/stop word/frequency calculations at query time.
>>>
>>> -- Ken
>>> -- 
>>> Ken Krugler
>>> <http://ken-blog.krugler.org>
>>> +1 530-265-2225
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Boolean-query-with-50%2C000-clauses%21-Possible--Scalable--tp24664839p24671050.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Boolean-query-with-50%2C000-clauses%21-Possible--Scalable--tp24664839p24701672.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Mime
View raw message