lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edoardo Marcora <edoardo.marc...@gmail.com>
Subject Re: Boolean query with 50,000 clauses! Possible? Scalable?
Date Sun, 26 Jul 2009 22:14:34 GMT

type:foo and type:bar are fields used to represent documents of different
"kind" (it could be "author" and "book"). field2 and field1 contains IDs
which I would like to use to join the two "kinds".


Ken Krugler wrote:
> 
>>awarnier wrote:
>>>
>>>  Edoardo Marcora wrote:
>>>>  I am faced with the requirement for a boolean query composed of 50,000
>>>>  clauses (all of them directed at the same field) all OR'ed together.
>>>>
>>>  By pure intellectual curiosity : can you provide some idea of the type
>>>  of query, and the type of content of the field this is targeted at ?
>>>  I have this notion that with 50,000 queries directed at one field,
>>> there
>>>  must be some smarter way of handling this than just OR-ing together the
>>>  results.
>>>
>>>
>>
>>What I would like to do is to take the results of one query and use one of
>>its fields (not the docid) as an argument to another query (much like a
>>subquery in SQL). For example:
>>
>>type:foo AND (_query_:type:bar AND field2:{field1})
>>
>>This should search for all types of foo and then iterate over the result
set
>>and perform a query for where type is bar and field2 is equal to the value
>>of field1 from each item of the first result set.
> 
> This looks like a more like this (MLT) query, where you restrict the 
> set to documents that have matching types...though I don't understand 
> the type:foo AND type:bar query, unless 'type' is a multi-value field.
> 
>  From what I remember of using MLT support in Lucene a few years back, 
> this takes the terms of the target field from the target document, 
> tosses out stop words, and then uses some arbitrary limit (e.g. 500) 
> for the first N terms used to do the query.
> 
> Unless the distribution of terms in the field is heavily skewed, this 
> gives you pretty good results. I supposed you could use the N most 
> common terms - but your stop word list isn't good, you'll get worse 
> results.
> 
> In any case, preprocessing the field will speed things up, versus 
> doing any analysis/stop word/frequency calculations at query time.
> 
> -- Ken
> -- 
> Ken Krugler
> <http://ken-blog.krugler.org>
> +1 530-265-2225
> 

-- 
View this message in context: http://www.nabble.com/Boolean-query-with-50%2C000-clauses%21-Possible--Scalable--tp24664839p24671050.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Mime
View raw message