lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claude Devarenne <cla...@library.ucsf.edu>
Subject Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Date Wed, 19 May 2004 14:09:28 GMT
Thanks,  I will look at the sorting code.  Sorting results by date is 
next on list.  For now, I only have a small number of documents but the 
set is to grow to over 8 million documents for the collection I am 
working on.  Another collection we have is 40 million documents or so.  
 From what you say it seems to me that sorting will not scale then when 
I get to larger number of documents.  I am considering using an SQL 
back end to implement sorting: bring back the unique IDs from lucene 
and then sort in SQL.

Claude

On May 18, 2004, at 11:23 PM, Morus Walter wrote:

> Claude Devarenne writes:
>> Hi,
>>
>> I have over 60,000 documents in my index which is slightly over a 1 GB
>> in size.  The documents range from the late seventies up to now.  I
>> have indexed dates as a keyword field using a string because the dates
>> are in YYYYMMDD format.  When I do range queries things are OK as long
>> as I don't exceed the built-in number of boolean clauses, so that's a
>> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
>> complex queries but also want to query over long ranges, e.g. 
>> [19790101
>> TO 19991231].
>>
>> Given these requirements, I am thinking of doing a query without the
>> date range, bring the unique ids back from the hits and then do a date
>> query in the SQL database I have that contains the same data.  Another
>> alternative is to do the query without the date range in Lucene and
>> then sort the results within the range.  I still have to learn how to
>> use the new sorting code and confessed I did not have time to look at
>> it yet.
>>
>> Is there a simpler, easier way to do this?
>>
> I think it would be worth to take a look at the sorting code.
>
> The idea of the sorting code is to have an array of the dates for each 
> doc
> in memory and access this array for sorting.
> Now sorting isn't the only thing one might use this array for.
> Doing a range check is another.
> So you might extend the sorting code by a range selection.
>
> There is no code for this in lucene and you have to create your own 
> searcher
> but it gives you a fast way to search and sort by date.
>
> I did this independently from the new sorting code (I just started a 
> little
> to early) and it works quite well.
> The only drawback from this (and the new sorting code) is, that it 
> requires
> an array of field values that must be rebuilt each time the index 
> changes.
> Shouldn't be a problem for 60000 documents.
>
> Morus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message