lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Lucene memory usage
Date Wed, 10 Jun 2009 20:26:31 GMT
On Wed, Jun 10, 2009 at 4:13 PM, Jason
Rutherglen<> wrote:
> Great! If I understand correctly it looks like RAM savings? Will
> there be an improvement in lookup speed? (We're using binary
> search here?).

Yes, sizable RAM reduction for apps that have many unique terms.  And,
init'ing (warming) the reader should be faster.

Lookup speed should be faster (binary search against the terms in a
single field, not all terms).

> Is there a precedence in database systems for what was mentioned
> about placing the term dict, delDocs, and filters onto disk and
> reading them from there (with the IO cache taking care of
> keeping the data in RAM)? (Would there be a future advantage to
> this approach when SSDs are more prevalent?) It seems like we
> could have some generalized pluggable system where one could try
> out this or the current heap approach, and benchmark.

LUCENE-1458 creates exactly such a pluggable system.  Ie it's lets you
swap in your own codec for terms, freq, prox, etc.

But: I'm leary of having terms dict live entirely on disk, though we
should certainly explore it.

> Given our continued inability to properly measure Java RAM
> usage, this approach may be a good one for Lucene? Where heap
> based LRU caches are a shot in the dark when it comes to mem
> size, as we never really know how much they're using.

Well remember mmap uses an LRU policy to decide when pages are swapped
to disk... so a search that's unlucky can easily hit many page faults
just in consulting the terms dict.  You could be at 200 msec cost
before you even hit a postings list... I prefer to have the terms
index RAM resident (of course the OS can still swap THAT out too...).

> Once we generalize delDocs, filters, and field caches
> (LUCENE-831?), then perhaps CSF is a good place to test out this
> approach? We could have a generic class that handles the
> underlying IO that simply returns values based on a position or
> iteration.

I agree, a CSF codec that uses mmap seems like a good place to


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message