lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vitaly Funstein <>
Subject Re: BlockTreeTermsReader consumes crazy amount of memory
Date Wed, 27 Aug 2014 23:41:40 GMT

Here's the screenshot; not sure if it will go through as an attachment
though - if not, I'll post it as a link. Please ignore the altered package
names, since Lucene is shaded in as part of our build process.

Some more context about the use case. Yes, the terms are pretty much
unique; the schema for the data set is actually borrowed from here: - it's the UserVisits
set, with a couple of other fields added by us. The values for the fields
are generated almost randomly, though some string fields are picked at
random from a fixed dictionary.

Also, this type of heap footprint might be tolerable if it stayed
relatively constant throughout the system's life cycle (of course, given
the index set stays more or less static). However, what happens here is
that one IndexReader reference is maintained by ReaderManager as an NRT
reader. But we also would like support an ability to execute searches
against specific index commit points, ideally in parallel. As you might
imagine, as soon as a new DirectoryReader is opened at a given commit, a
whole new set of SegmentReader instances is created and populated,
effectively doubling the already large heap usage... if there was a way to
somehow reuse readers for unchanged segments already pooled by IndexWriter,
that would help tremendously here. But I don't think there's a way to link
up the two sets, at least not in the Lucene version we are using (4.6.1) -
is this correct?

On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless <> wrote:

> This is surprising: unless you have an excessive number of unique
> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
> Bu you only have 12 unique fields?
> Can you post screen shots of the heap usage?
> Mike McCandless
> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein <>
> wrote:
> > This is a follow up to the earlier thread I started to understand memory
> > usage patterns of SegmentReader instances, but I decided to create a
> > separate post since this issue is much more serious than the heap
> overhead
> > created by use of stored field compression.
> >
> > Here is the use case, once again. The index totals around 300M documents,
> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields which are
> both
> > indexed and stored. It is split into 4 shards, which are basically
> separate
> > indices... if that matters. After the index is populated (but not
> optimized
> > since we don't do that), the overall heap usage taken up by Lucene is
> over
> > 1 GB, much of which is taken up by instances of BlockTreeTermsReader. For
> > instance for the largest segment in one such an index, the retained heap
> > size of the internal tree map is around 50 MB. This is evident from heap
> > dump analysis, which I have screenshots of that I can post here, if that
> > helps. As there are many segments of various sizes in the index, as
> > expected, the total heap usage for one shard stands at around 280 MB.
> >
> > Could someone shed some light on whether this is expected, and if so -
> how
> > could I possibly trim down memory usage here? Is there a way to switch
> to a
> > different terms index implementation, one that doesn't preload all the
> > terms into RAM, or only does this partially, i.e. as a cache? I'm not
> sure
> > if I'm framing my questions correctly, as I'm obviously not an expert on
> > Lucene's internals, but this is going to become a critical issue for
> large
> > scale use cases of our system.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message