lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <>
Subject Re: Lucene memory usage
Date Wed, 10 Jun 2009 23:23:16 GMT
Cool! Sounds like with LUCENE-1458 we can experiment with some
of these things. Does CSF become just another codec?

> I'm leary of having terms dict live entirely on disk, though
we should certainly explore it.

Yeah, it should theoretically help with reloading, it could use
a skiplist (as we have a disk version of that implemented)
instead of binarysearch). It seems like with things like
TrieRange (which potentially adds many fields and terms) it
could be useful to let the IO cache calculate what we need in
RAM and what we don't, otherwise we're constantly at risk of
exceeding heap usage. There'll be other potential RAM issues
(such as page faults), but it seems like users will constantly
be up against the inability to precalculate Java heap usage of
data structures (whereas file based data usage can be measured).
Norms are another example, and with flexible indexing (and
scoring?) there may be additional fields the user may want to
change dynamically, that if completely loaded into heap cause
OOM problems.

I guess I personally think it would be great to not worry about
exceeding heap with Lucene apps (as it's a guessing game), and
then one can simply analyze the OS level IO cache/swap space to
see if the app could slow down due to the machine not having
enough RAM. I think this would remove one of the major
differences between a Java based search engine and a C++ based

On Wed, Jun 10, 2009 at 1:26 PM, Michael McCandless <> wrote:

> On Wed, Jun 10, 2009 at 4:13 PM, Jason
> Rutherglen<> wrote:
> > Great! If I understand correctly it looks like RAM savings? Will
> > there be an improvement in lookup speed? (We're using binary
> > search here?).
> Yes, sizable RAM reduction for apps that have many unique terms.  And,
> init'ing (warming) the reader should be faster.
> Lookup speed should be faster (binary search against the terms in a
> single field, not all terms).
> > Is there a precedence in database systems for what was mentioned
> > about placing the term dict, delDocs, and filters onto disk and
> > reading them from there (with the IO cache taking care of
> > keeping the data in RAM)? (Would there be a future advantage to
> > this approach when SSDs are more prevalent?) It seems like we
> > could have some generalized pluggable system where one could try
> > out this or the current heap approach, and benchmark.
> LUCENE-1458 creates exactly such a pluggable system.  Ie it's lets you
> swap in your own codec for terms, freq, prox, etc.
> But: I'm leary of having terms dict live entirely on disk, though we
> should certainly explore it.
> > Given our continued inability to properly measure Java RAM
> > usage, this approach may be a good one for Lucene? Where heap
> > based LRU caches are a shot in the dark when it comes to mem
> > size, as we never really know how much they're using.
> Well remember mmap uses an LRU policy to decide when pages are swapped
> to disk... so a search that's unlucky can easily hit many page faults
> just in consulting the terms dict.  You could be at 200 msec cost
> before you even hit a postings list... I prefer to have the terms
> index RAM resident (of course the OS can still swap THAT out too...).
> > Once we generalize delDocs, filters, and field caches
> > (LUCENE-831?), then perhaps CSF is a good place to test out this
> > approach? We could have a generic class that handles the
> > underlying IO that simply returns values based on a position or
> > iteration.
> I agree, a CSF codec that uses mmap seems like a good place to
> start...
> Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message