lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From craiglan...@gmail.com
Subject Re: BlockTreeTermsReader consumes crazy amount of memory
Date Fri, 29 Aug 2014 00:41:41 GMT
-(FST)=
Sent from my BlackBerry® smartphone

-----Original Message-----
From: Michael McCandless <lucene@mikemccandless.com>
Date: Thu, 28 Aug 2014 15:49:30 
To: Lucene Users<java-user@lucene.apache.org>
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Ugh, you're right: this still won't re-use from IW's reader pool.  Can
you open an issue?  Somehow we should make this easier.

In the meantime, I guess you can use openIfChanged from your "back in
time" reader to open another "back in time" reader.  This way you have
two pools... IW's pool for the series of NRT readers, and another pool
shared by the "back in time" readers ... but we should somehow fix
this so it's one pool.

OK looks like it's the FST terms index, and yes synthetic terms gives
you synthetic results :)  However, to reduce the FST ram here you can
just increase the block sizes uses by the terms index (see
BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
possibly slower searches, especially MultiTermQueries ...

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein <vfunstein@gmail.com> wrote:
> Thanks, Mike - I think the issue is actually the latter, i.e. SegmentReader
> on its own can certainly use enough heap to cause problems, which of course
> would be made that much worse by failure to pool readers for unchanged
> segments.
>
> But where are you seeing the behavior that would result in reuse of
> SegmentReaders from the pool inside the index writer? If I'm reading the
> code right here, here's what it calls:
>
>   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> throws IOException {
>     ensureOpen();
>
>     // If we were obtained by writer.getReader(), re-ask the
>     // writer to get a new reader.
>     if (writer != null) {
>       return doOpenFromWriter(commit);
>     } else {
>       return doOpenNoWriter(commit);
>     }
>   }
>
>   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> IOException {
>     if (commit != null) {
>       return doOpenFromCommit(commit);
>     }
> ......
>
> There is no attempt made to inspect the segments inside the commit point
> here, for possible reader pool reuse.
>
> So here's a drill down into the SegmentReader memory foot print. There
> aren't actually 88 fields here - rather, this number reflects the "shallow"
> heap size of BlockTreeTermsReader instance, i.e. calculated size without
> following any the references from it (at depth 0).
>
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
>
> I suppose totally randomly generated field values are a bit of a contrived
> use case, since in a real world there will be far less randomness to each,
> but perhaps this gives us an idea for the worst case scenario... just
> guessing though.
>
>
>
> On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Can you drill down some more to see what's using those ~46 MB?  Is the
>> the FSTs in the terms index?
>>
>> But, we need to decouple the "single segment is opened with multiple
>> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
>> to hold terms index".  E.g. from this screen shot it looks like there
>> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein <vfunstein@gmail.com>
>> wrote:
>> > Here's the link:
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
>> >
>> > I'm indexing let's say 11 unique fields per document. Also, the NRT
>> reader
>> > is opened continually, and "regular" searches use that one. But a special
>> > kind of feature allows searching a particular point in time (they get
>> > cleaned out based on some other logic), which requires opening a non-NRT
>> > reader just to service such search requests - in my understanding no
>> > segment readers for this reader can be shared with the NRT reader's
>> pool...
>> > or am I off here? This seems evident from another heap dump fragment that
>> > shows a full new set of segment readers attached to that "temporary"
>> > reader:
>> >
>> >
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing
>> >
>> >
>> > On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >> Hmm screen shot didn't make it ... can you post link?
>> >>
>> >> If you are using NRT reader then when a new one is opened, it won't
>> >> open new SegmentReaders for all segments, just for newly
>> >> flushed/merged segments since the last reader was opened.  So for your
>> >> N commit points that you have readers open for, they will be sharing
>> >> SegmentReaders for segments they have in common.
>> >>
>> >> How many unique fields are you adding?
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein <vfunstein@gmail.com>
>> >> wrote:
>> >> > Mike,
>> >> >
>> >> > Here's the screenshot; not sure if it will go through as an attachment
>> >> > though - if not, I'll post it as a link. Please ignore the altered
>> >> package
>> >> > names, since Lucene is shaded in as part of our build process.
>> >> >
>> >> > Some more context about the use case. Yes, the terms are pretty much
>> >> unique;
>> >> > the schema for the data set is actually borrowed from here:
>> >> > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the
>> UserVisits
>> >> > set, with a couple of other fields added by us. The values for the
>> fields
>> >> > are generated almost randomly, though some string fields are picked
at
>> >> > random from a fixed dictionary.
>> >> >
>> >> > Also, this type of heap footprint might be tolerable if it stayed
>> >> relatively
>> >> > constant throughout the system's life cycle (of course, given the
>> index
>> >> set
>> >> > stays more or less static). However, what happens here is that one
>> >> > IndexReader reference is maintained by ReaderManager as an NRT reader.
>> >> But
>> >> > we also would like support an ability to execute searches against
>> >> specific
>> >> > index commit points, ideally in parallel. As you might imagine, as
>> soon
>> >> as a
>> >> > new DirectoryReader is opened at a given commit, a whole new set of
>> >> > SegmentReader instances is created and populated, effectively doubling
>> >> the
>> >> > already large heap usage... if there was a way to somehow reuse
>> readers
>> >> for
>> >> > unchanged segments already pooled by IndexWriter, that would help
>> >> > tremendously here. But I don't think there's a way to link up the two
>> >> sets,
>> >> > at least not in the Lucene version we are using (4.6.1) - is this
>> >> correct?
>> >> >
>> >> >
>> >> > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
>> >> > <lucene@mikemccandless.com> wrote:
>> >> >>
>> >> >> This is surprising: unless you have an excessive number of unique
>> >> >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
>> >> >>
>> >> >> Bu you only have 12 unique fields?
>> >> >>
>> >> >> Can you post screen shots of the heap usage?
>> >> >>
>> >> >> Mike McCandless
>> >> >>
>> >> >> http://blog.mikemccandless.com
>> >> >>
>> >> >>
>> >> >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein <
>> vfunstein@gmail.com>
>> >> >> wrote:
>> >> >> > This is a follow up to the earlier thread I started to understand
>> >> memory
>> >> >> > usage patterns of SegmentReader instances, but I decided to
create
>> a
>> >> >> > separate post since this issue is much more serious than the
heap
>> >> >> > overhead
>> >> >> > created by use of stored field compression.
>> >> >> >
>> >> >> > Here is the use case, once again. The index totals around
300M
>> >> >> > documents,
>> >> >> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields
which
>> are
>> >> >> > both
>> >> >> > indexed and stored. It is split into 4 shards, which are basically
>> >> >> > separate
>> >> >> > indices... if that matters. After the index is populated (but
not
>> >> >> > optimized
>> >> >> > since we don't do that), the overall heap usage taken up by
Lucene
>> is
>> >> >> > over
>> >> >> > 1 GB, much of which is taken up by instances of
>> BlockTreeTermsReader.
>> >> >> > For
>> >> >> > instance for the largest segment in one such an index, the
retained
>> >> heap
>> >> >> > size of the internal tree map is around 50 MB. This is evident
from
>> >> heap
>> >> >> > dump analysis, which I have screenshots of that I can post
here, if
>> >> that
>> >> >> > helps. As there are many segments of various sizes in the
index, as
>> >> >> > expected, the total heap usage for one shard stands at around
280
>> MB.
>> >> >> >
>> >> >> > Could someone shed some light on whether this is expected,
and if
>> so -
>> >> >> > how
>> >> >> > could I possibly trim down memory usage here? Is there a way
to
>> switch
>> >> >> > to a
>> >> >> > different terms index implementation, one that doesn't preload
all
>> the
>> >> >> > terms into RAM, or only does this partially, i.e. as a cache?
I'm
>> not
>> >> >> > sure
>> >> >> > if I'm framing my questions correctly, as I'm obviously not
an
>> expert
>> >> on
>> >> >> > Lucene's internals, but this is going to become a critical
issue
>> for
>> >> >> > large
>> >> >> > scale use cases of our system.
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message