lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <>
Subject Re: Large scale sorting
Date Mon, 09 Apr 2007 22:15:57 GMT

On 10/04/2007, at 4:18 AM, Doug Cutting wrote:

> Paul Smith wrote:
>> Disadvantages to this approach:
>> * It's a lot more I/O intensive
> I think this would be prohibitive.  Queries matching more than a  
> few hundred documents will take several seconds to sort, since  
> random disk accesses are required per matching document.  Such an  
> approach is only practical if you can guarantee that queries match  
> fewer than a hundred documents, which is not generally the case,  
> especially with large collections.

I don't disagree with the premise that it involves substantial I/O  
and would increase the time taken to sort, and why this approach  
shouldn't be the default mechanism, but it's not too difficult to  
build a disk I/O subsystem that can allocate many spindles to service  
this and to allow the underlying OS to use it's buffer cache (yes  
this is sounding like a database server now isn't it).

>> I'm working on the basis that it's a LOT harder/more expensive to  
>> simply allocate more heap size to cover the current sorting  
>> infrastructure.   One hits memory limits faster.  Not everyone can  
>> afford 64-bit hardware with many Gb RAM to allocate to a heap.  It  
>> _is_ cheaper/easier to build a disk subsystem to tune this I/O  
>> approach, and one can still use any RAM as buffer cache for the  
>> memory-mapped file anyway.
> In my experience, raw search time starts to climb towards one  
> second per query as collections grow to around 10M documents (in  
> round figures and with lots of assumptions).  Thus, searching on a  
> single CPU is less practical as collections grow substantially  
> larger than 10M documents, and distributed solutions are required.   
> So it would be convenient if sorting is also practical for ~10M  
> document collections on standard hardware.  If 10M strings with 20  
> characters are required in memory for efficient search, this  
> requires 400MB.  This is a lot, but not an unusual amount on todays  
> machines.  However, if you have a large number of fields, then this  
> approach may be problematic and force you to consider a distributed  
> solution earlier than you might otherwise.

400Mb is not a lot in of itself, but when one has many of these types  
of indexes, with many sorting fields with many locales on the same  
host it becomes problematic.  I'm sure there's a point where  
distributing doesn't work over really large collections, because even  
if one partitioned an index across many hosts, one still needs to  
merge sort the results together.

It would be disappointing if Lucene's innate design limited itself to  
10M document collections before needing to consider distributed  
solutions.  10M is not that many.   It would be better if the sorting  
mechanism in Lucene was a little more decoupled such that more  
customised designs could be utilitised for specific scenarios.  Right  
now it's a one-for-all approach without substantial gutting of the code.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message