lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artem Vasiliev <art...@gmail.com>
Subject Re[2]: 30 milllion+ docs on a single server
Date Mon, 21 Aug 2006 22:39:05 GMT
Hi guys!

I have noticed many questions on the list vonsidering Lucene sorting
memory consumption and hope my solution can help someone.

I faced a memory/time consumption problem on sorting in Lucene back in
April. With a help of this list's experts I came to solution which I
like: documents from the sorting set (instead of given the field's
values from the whole index) are lazy-cached in a WeakHashMap so the
cached items are candidates for GC. I didn't create a patch yet (and
I'm going to make it) but the code is ready to be reused and it's used
for a while as a part of my open-source project, sharehound
(http://sharehound.sourceforge.net). The code can be now reached by a
CVS browser, it's in 4 classes in subdirectories of
http://sharehound.cvs.sourceforge.net/sharehound/jNetCrawler/src/java/org/apache/lucene/.

They (both classes, as a part of sharehound.jar, and sources) can also
be downloaded with the latest (1.1.3 alpha) sharehound release zip
file.

LazyCachingSortFactory class have an example of use in its header
comments, I duplicate it here:

/**
 * Creates a Sort that doesn't use FieldCache and doesn't therefore load all the field values
from the index while sorting.
 * Instead it uses CachingIndexReader (via CachingDocFieldComparatorSource) to cache documents
from which field values
 * are fetched. If the caller Searcher is based on CachingIndexReader itself its document
cache will be used here.
 *
 *
 * An example of use:
  ...
  hits = getQueryHits(query, getSort(listSorting.getSortFieldName(), listSorting.isSortDescending()));
  ...

  public Sort getSort(String sortFieldName, boolean sortDescending) {
    return LazyCachingSortFactory.create(sortFieldName, sortDescending);
  }
  ...
  protected Hits getQueryHits(Query query, Sort sort) throws IOException {
     IndexSearcher indexSearcher = new IndexSearcher(CachingIndexReader.decorateIfNeeded(IndexReader.open(getIndexDir())));
     return indexSearcher.search(query, sort);
  }

*/

This solution does great work saving a memory and in case of
relatively search resultsets it's almost as fast as the default
implementation. Note that this solution is ready only for single-field
sorting currently.

Best regards,
Artem

OG> This is unlikely to work well/fast.  It will depend on the
OG> size of the index (not in terms of the number of docs, but its
OG> physical size), the number of queries/second and desired query
OG> latency.  If you can wait 10 seconds to get a query and if only a
OG> few queries are hitting the server at any one time, then you may
OG> be Ok.  Having things be up to date with non-relevancy sorting
OG> will be quite tough.  FieldCache will consume some RAM.  Warming
OG> it up will take some number of seconds.  Re-opening an
OG> IndexSearcher after index changes will also cost you a bit of time.

OG> Consider a 64-bit server with more RAM that allowed larger
OG> Java heaps, and try to fit your index into RAM.

OG> Otis

OG> ----- Original Message ----
OG> From: Mark Miller <markrmiller@gmail.com>
OG> To: java-user@lucene.apache.org
OG> Sent: Saturday, August 12, 2006 7:45:15 PM
OG> Subject: Re: 30 milllion+ docs on a single server

OG> The single server is important because I think it will take a lot of
OG> work to scale it to multiple servers. The index must allow for close to
OG> real-time updates and additions. It must also remain searchable at all
OG> times (other than than during the brief period of single updates and
OG> additions). If it is easy to scale this to multiple servers please tell
OG> me how.

OG> - Mark
>> Why is a single server so important?  I can scale horizontally much
>> cheaper
>> than I scale vertically.
>>
>>
>>
>> On 8/11/06, Mark Miller <markrmiller@gmail.com> wrote:
>>>
>>> I've made a nice little archive application with lucene. I made it to
>>> handle our largest need: 2.5 million docs or so on a single server. Now
>>> the powers that be say: lets use it for a 30+ million document archive
>>> on a single server! (each doc size maybe 10k max...as small as a 1 or
>>> 2k) Please tell me why we are in trouble...please tell me why we are
>>> not. I have tested up to 2 million docs without much trouble but 30
>>> million...the average search will include a sort on a field as
>>> well...can I search 30+ million docs with a sort? Man am I worried about
>>> that. Maybe the server will have 8 procs and 12 billion gigs of RAM.
>>> Mabye. Even still, Tomcat seems to be able to launch with a max of 1.5
>>> or 1.6 gig of Ram in Windows. What do you think? 30 million+ sounds like
>>> too much of a load to me for a single server. Not that they care what I
>>> think...I only wrote the thing (man I hate my job, offer me a new one :)
>>> )...please...comments?
>>>
>>> Cheers,
>>>
>>> Miserable Mark
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>


OG> ---------------------------------------------------------------------
OG> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
OG> For additional commands, e-mail: java-user-help@lucene.apache.org





OG> ---------------------------------------------------------------------
OG> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
OG> For additional commands, e-mail: java-user-help@lucene.apache.org






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message