lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Large scale sorting
Date Mon, 09 Apr 2007 18:18:16 GMT
Paul Smith wrote:
> Disadvantages to this approach:
> * It's a lot more I/O intensive

I think this would be prohibitive.  Queries matching more than a few 
hundred documents will take several seconds to sort, since random disk 
accesses are required per matching document.  Such an approach is only 
practical if you can guarantee that queries match fewer than a hundred 
documents, which is not generally the case, especially with large 

> I'm working on the basis that it's a LOT harder/more expensive to simply 
> allocate more heap size to cover the current sorting infrastructure.   
> One hits memory limits faster.  Not everyone can afford 64-bit hardware 
> with many Gb RAM to allocate to a heap.  It _is_ cheaper/easier to build 
> a disk subsystem to tune this I/O approach, and one can still use any 
> RAM as buffer cache for the memory-mapped file anyway.

In my experience, raw search time starts to climb towards one second per 
query as collections grow to around 10M documents (in round figures and 
with lots of assumptions).  Thus, searching on a single CPU is less 
practical as collections grow substantially larger than 10M documents, 
and distributed solutions are required.  So it would be convenient if 
sorting is also practical for ~10M document collections on standard 
hardware.  If 10M strings with 20 characters are required in memory for 
efficient search, this requires 400MB.  This is a lot, but not an 
unusual amount on todays machines.  However, if you have a large number 
of fields, then this approach may be problematic and force you to 
consider a distributed solution earlier than you might otherwise.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message