lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artem Vasiliev <>
Subject Re[2]: 30 milllion+ docs on a single server
Date Mon, 21 Aug 2006 22:39:05 GMT
Hi guys!

I have noticed many questions on the list vonsidering Lucene sorting
memory consumption and hope my solution can help someone.

I faced a memory/time consumption problem on sorting in Lucene back in
April. With a help of this list's experts I came to solution which I
like: documents from the sorting set (instead of given the field's
values from the whole index) are lazy-cached in a WeakHashMap so the
cached items are candidates for GC. I didn't create a patch yet (and
I'm going to make it) but the code is ready to be reused and it's used
for a while as a part of my open-source project, sharehound
( The code can be now reached by a
CVS browser, it's in 4 classes in subdirectories of

They (both classes, as a part of sharehound.jar, and sources) can also
be downloaded with the latest (1.1.3 alpha) sharehound release zip

LazyCachingSortFactory class have an example of use in its header
comments, I duplicate it here:

 * Creates a Sort that doesn't use FieldCache and doesn't therefore load all the field values
from the index while sorting.
 * Instead it uses CachingIndexReader (via CachingDocFieldComparatorSource) to cache documents
from which field values
 * are fetched. If the caller Searcher is based on CachingIndexReader itself its document
cache will be used here.
 * An example of use:
  hits = getQueryHits(query, getSort(listSorting.getSortFieldName(), listSorting.isSortDescending()));

  public Sort getSort(String sortFieldName, boolean sortDescending) {
    return LazyCachingSortFactory.create(sortFieldName, sortDescending);
  protected Hits getQueryHits(Query query, Sort sort) throws IOException {
     IndexSearcher indexSearcher = new IndexSearcher(CachingIndexReader.decorateIfNeeded(;
     return, sort);


This solution does great work saving a memory and in case of
relatively search resultsets it's almost as fast as the default
implementation. Note that this solution is ready only for single-field
sorting currently.

Best regards,

OG> This is unlikely to work well/fast.  It will depend on the
OG> size of the index (not in terms of the number of docs, but its
OG> physical size), the number of queries/second and desired query
OG> latency.  If you can wait 10 seconds to get a query and if only a
OG> few queries are hitting the server at any one time, then you may
OG> be Ok.  Having things be up to date with non-relevancy sorting
OG> will be quite tough.  FieldCache will consume some RAM.  Warming
OG> it up will take some number of seconds.  Re-opening an
OG> IndexSearcher after index changes will also cost you a bit of time.

OG> Consider a 64-bit server with more RAM that allowed larger
OG> Java heaps, and try to fit your index into RAM.

OG> Otis

OG> ----- Original Message ----
OG> From: Mark Miller <>
OG> To:
OG> Sent: Saturday, August 12, 2006 7:45:15 PM
OG> Subject: Re: 30 milllion+ docs on a single server

OG> The single server is important because I think it will take a lot of
OG> work to scale it to multiple servers. The index must allow for close to
OG> real-time updates and additions. It must also remain searchable at all
OG> times (other than than during the brief period of single updates and
OG> additions). If it is easy to scale this to multiple servers please tell
OG> me how.

OG> - Mark
>> Why is a single server so important?  I can scale horizontally much
>> cheaper
>> than I scale vertically.
>> On 8/11/06, Mark Miller <> wrote:
>>> I've made a nice little archive application with lucene. I made it to
>>> handle our largest need: 2.5 million docs or so on a single server. Now
>>> the powers that be say: lets use it for a 30+ million document archive
>>> on a single server! (each doc size maybe 10k small as a 1 or
>>> 2k) Please tell me why we are in trouble...please tell me why we are
>>> not. I have tested up to 2 million docs without much trouble but 30
>>> million...the average search will include a sort on a field as
>>> well...can I search 30+ million docs with a sort? Man am I worried about
>>> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
>>> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
>>> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like
>>> too much of a load to me for a single server. Not that they care what I
>>> think...I only wrote the thing (man I hate my job, offer me a new one :)
>>> )...please...comments?
>>> Cheers,
>>> Miserable Mark
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:

OG> ---------------------------------------------------------------------
OG> To unsubscribe, e-mail:
OG> For additional commands, e-mail:

OG> ---------------------------------------------------------------------
OG> To unsubscribe, e-mail:
OG> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message