lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Lucene memory usage
Date Thu, 11 Jun 2009 09:43:09 GMT
On Wed, Jun 10, 2009 at 9:24 PM, Jason
Rutherglen<> wrote:
> I read over the LUCENE-1458 comments again. Interesting. I think
> the most compelling argument is that the various files we're
> normally loading into the heap are, after merging, in the IO
> cache. If we can simply reuse the IO cache rather then allocate
> a bunch of redundant arrays in heap, we could be better off? I
> think this is very compelling for field caches, delDocs, and
> bitsets that are tied to segments and loaded after each merge.

The OS doesn't have enough information to "know" what data structures
are important to Lucene (must stay hot) and which are less so.  It's
blind LRU approach is often a poor policy (eg for terms dict, where a
binary search could easily suddenly need to visit a random rarely
accessed page).

For example, after merging, all the segments we just *read* from will
also be hot, having flushed out other important pages from the IO
cache, which is very much not what we want to do.  From C, and per-OS,
you can inform the OS that it should not cache the bytes read from the
file, but from Java we just can't control that.

> I think it's possible to write some basic benchmarks to test a
> byte[] BitVector vs.a MappedByteBuffer BitVector and see what
> happens.

Yes, but this is challenging to test properly.  On systems with plenty
of RAM, the approaches should be similarly fast.  On systems starved
for RAM, both approaches should thrash miserably.  It's the cases in
between that we need to test for.

> The other potentially interesting angle here is in regards to
> realtime updates, where we can implement a MMaped page type of
> system so blocks of this stuff can be updated in near realtime,
> directly in the MMaped space (similar to how in heap land with
> LUCENE-1526 we're looking at breaking up the byte[] into a
> byte[][]).

But carrying such updates via RAM, like we do now for deletions,
should generally be more performant (you never have to put the changes
on disk).

> Also if we assume data is MMaped I don't think it matters as much if
> the updates on disk are not in sequence? (Whereas today we try
> to keep all our files sequentially readable optimized). Of
> course I could be completely wrong. :)

Well... locality is still important.  Under the hood, mmap on a page
miss must hit the disk.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message