lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <>
Subject Large scale sorting
Date Sat, 07 Apr 2007 02:46:06 GMT
A discussion on the user list brought my mind to the longer term  
scalability issues of Lucene.  Lucene is inherently memory efficient,  
except for sorting, when the inverted index nature of the index works  
against the required nature of having a value for each object to sort  

I'm hoping this discussion will stimulate some thought into using  
Lucene in environments of truly big indexes that need efficient multi- 
field sorting without requiring a large heap size to manage it.  I'm  
not really being altruistic here because how we use Lucene would  
really benefit from some improvements in this area.

There has been some provided patches outlined that use a WeakHashMap  
implementation but I'm not sure that's really the best long term  
strategy. I've been looking over the FieldCache+Impl and the  
FieldSortedHitQueue classes and I wonder if some use of disk might  
work for sorting on big indexes.

Conceptually, to sort in Lucene, you must have:

1) For a given Doc ID, it's 'relationship' to other docs for a given  
field/locale combination.  If each Doc ID is ordered based on this,  
then one can simply compare 2 docs position within the structure.    
This is how FieldSortHitQueue does it.

2) the raw value of a documents field so that the merge sort can  
complete when merging results from multiple segments.

To achieve point 1 is relatively efficient in terms of memory, it's  
simply 1 array the size of numDocs with the value at the index it's  
numerical position within the sorted list of terms.  Point 2 is where  
the memory can go off the charts, particularly for Strings, and to  
make sorting efficient over the long hall these terms for each doc  
are cached in memory.  This is why everyone recommends 'warming up' a  
large index; it's purely loading this stuff into memory!

If you have large indices, and require sorting on multiple fields,  
you can very easily saturate a heap.  In our case we get hit by  
another multiplier: Locale.  If you need to sort by these same #  
fields in multiple Locale's... (we currently support English,  
Chinese, French and Japanese sorting, with plans for other languages  
at some point........)

I'm just going to throw this idea up as a sort of straw-man,  
hopefully to get some discussion going.

Rather than hold the say, String[] of terms for String field type, in  
memory, what about writing them to disk in an offset Memory Mapped  
file (NIO)?   The file structure might look something like:

Header -  For each doc #, a byte offset start position into the data  
block, plus a length.  So, 2x4byte ints per Doc #.
Data - At particular location, each doc's term data is written.

For a given index of 10 million documents, with an average field  
length of 43 characters (sample taken from one of our production db  
for mail subject lines), the file size would be:

* Header = 76Mb
* Data = 410Mb

(This just shows you why holding the String[] in memory is not  
scalable, yes one can get reductions based on String interning, but  
one should always consider the worst case with non-unique field values)

Performing the lookup of a given Doc's term text to be used for  
sorting would require:

* Seek to (doc # x (2x4byte)) in the header
* Read offset and length
* Seek to Data location, read length bytes and convert to, say, String

* Almost no memory retained for use in sorting.  This means one could  
hold many more large indexes open.
* When large index closed, less GC activity since less data held for  
sorting is retained.
* Can now sort on any # fields for practically any sized index for  
any # locales.

Disadvantages to this approach:
* It's a lot more I/O intensive
* would be slower than current Sorting
* would result in more GC activity

I'm wondering then if the Sorting infrastructure could be refactored  
to allow  with some sort of policy/strategy where one can choose a  
point where one is not willing to use memory for sorting, but willing  
to pay the disk penalty to get the sort done over a large index  
without having to allocate a massive gob of heap memory to do it.    
This would allow the standard behavior to continue, but once an index  
reaches a threshold it no longer scales hideously.

I'm working on the basis that it's a LOT harder/more expensive to  
simply allocate more heap size to cover the current sorting  
infrastructure.   One hits memory limits faster.  Not everyone can  
afford 64-bit hardware with many Gb RAM to allocate to a heap.  It  
_is_ cheaper/easier to build a disk subsystem to tune this I/O  
approach, and one can still use any RAM as buffer cache for the  
memory-mapped file anyway.

I'm almost certain to have screwed up a calculation somewhere, so  
feel free to point out weaknesses/gaps etc.

To accomplish this would require a substantial change to the  
FieldSortHitQueue et al, and I realize that the use of NIO  
immediately pins Lucene to Java 1.4, so I'm sure this is  
controversial.  But, if we wish Lucene to go beyond where it is now,  
I think we need to start thinking about this particular problem  
sooner rather than later.

Happy Easter to all,

Paul Smith

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message