lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Boolean query with 50,000 clauses! Possible? Scalable?
Date Mon, 27 Jul 2009 11:29:49 GMT
This strikes me as an example of http://people.apache.org/~hossman/#xyproblem 
   Namely, you've declared the solution you would like, but haven't  
told us the problem.

I highly doubt that double loop is going to scale.  It wouldn't scale  
in a database, either, so it makes me think we need to take a step  
back and ask a bit more about the problem you are trying to solve and  
not the solution.  Can you share more details about it?

On Jul 26, 2009, at 6:14 PM, Edoardo Marcora wrote:

>
> type:foo and type:bar are fields used to represent documents of  
> different
> "kind" (it could be "author" and "book"). field2 and field1 contains  
> IDs
> which I would like to use to join the two "kinds".
>
>
> Ken Krugler wrote:
>>
>>> awarnier wrote:
>>>>
>>>> Edoardo Marcora wrote:
>>>>> I am faced with the requirement for a boolean query composed of  
>>>>> 50,000
>>>>> clauses (all of them directed at the same field) all OR'ed  
>>>>> together.
>>>>>
>>>> By pure intellectual curiosity : can you provide some idea of the  
>>>> type
>>>> of query, and the type of content of the field this is targeted  
>>>> at ?
>>>> I have this notion that with 50,000 queries directed at one field,
>>>> there
>>>> must be some smarter way of handling this than just OR-ing  
>>>> together the
>>>> results.
>>>>
>>>>
>>>
>>> What I would like to do is to take the results of one query and  
>>> use one of
>>> its fields (not the docid) as an argument to another query (much  
>>> like a
>>> subquery in SQL). For example:
>>>
>>> type:foo AND (_query_:type:bar AND field2:{field1})
>>>
>>> This should search for all types of foo and then iterate over the  
>>> result
> set
>>> and perform a query for where type is bar and field2 is equal to  
>>> the value
>>> of field1 from each item of the first result set.
>>
>> This looks like a more like this (MLT) query, where you restrict the
>> set to documents that have matching types...though I don't understand
>> the type:foo AND type:bar query, unless 'type' is a multi-value  
>> field.
>>
>> From what I remember of using MLT support in Lucene a few years back,
>> this takes the terms of the target field from the target document,
>> tosses out stop words, and then uses some arbitrary limit (e.g. 500)
>> for the first N terms used to do the query.
>>
>> Unless the distribution of terms in the field is heavily skewed, this
>> gives you pretty good results. I supposed you could use the N most
>> common terms - but your stop word list isn't good, you'll get worse
>> results.
>>
>> In any case, preprocessing the field will speed things up, versus
>> doing any analysis/stop word/frequency calculations at query time.
>>
>> -- Ken
>> -- 
>> Ken Krugler
>> <http://ken-blog.krugler.org>
>> +1 530-265-2225
>>
>
> -- 
> View this message in context: http://www.nabble.com/Boolean-query-with-50%2C000-clauses%21-Possible--Scalable--tp24664839p24671050.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message