lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From craiglan...@gmail.com
Subject Re: BlockTreeTermsReader consumes crazy amount of memory
Date Thu, 28 Aug 2014 22:48:35 GMT
(commit != null) {madnbrett@gmail.com
 ]     return doOpenFromCommit(commit;
    =Dandapper63@gmail.com
Sent from my BlackBerry® smartphone

-----Original Message-----
From: Michael McCandless <lucene@mikemccandless.com>
Date: Thu, 28 Aug 2014 14:25:11 
To: Lucene Users<java-user@lucene.apache.org>
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

You can actually use IndexReader.openIfChanged(latestNRTReader,
IndexCommit): this should pull/share SegmentReaders from the pool
inside IW, when available.  But it will fail to share e.g.
SegmentReader no longer part of the NRT view but shared by e.g. two
"back in time" readers.

Really we need to factor out the reader pooling somehow, such that IW
is a user for its NRT pool, but commit-point readers could also more
easily use a shared pool.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 28, 2014 at 2:07 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Hi,
>
> if you open the 2nd instance (the back in time reader) using DirectoryReader.open(IndexCommit),
then it has of course nothing in common with the IndexWriter, so how can they share the SegmentReaders?
>
> NRT readers from DirectoryReader.open(IndexWriter) are cached inside IndexWriter, but
the completely outside DirectoryReader on the older commit point opens all segments on its
own. Maybe a solution would be to extends IndexWriter.open() to also take a commit point with
IndexWriter.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Vitaly Funstein [mailto:vfunstein@gmail.com]
>> Sent: Thursday, August 28, 2014 7:56 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: BlockTreeTermsReader consumes crazy amount of memory
>>
>> Here's the link:
>> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?us
>> p=sharing
>>
>> I'm indexing let's say 11 unique fields per document. Also, the NRT reader is
>> opened continually, and "regular" searches use that one. But a special kind of
>> feature allows searching a particular point in time (they get cleaned out
>> based on some other logic), which requires opening a non-NRT reader just to
>> service such search requests - in my understanding no segment readers for
>> this reader can be shared with the NRT reader's pool...
>> or am I off here? This seems evident from another heap dump fragment that
>> shows a full new set of segment readers attached to that "temporary"
>> reader:
>>
>> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp
>> =sharing
>>
>>
>> On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>> > Hmm screen shot didn't make it ... can you post link?
>> >
>> > If you are using NRT reader then when a new one is opened, it won't
>> > open new SegmentReaders for all segments, just for newly
>> > flushed/merged segments since the last reader was opened.  So for your
>> > N commit points that you have readers open for, they will be sharing
>> > SegmentReaders for segments they have in common.
>> >
>> > How many unique fields are you adding?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein <vfunstein@gmail.com>
>> > wrote:
>> > > Mike,
>> > >
>> > > Here's the screenshot; not sure if it will go through as an
>> > > attachment though - if not, I'll post it as a link. Please ignore
>> > > the altered
>> > package
>> > > names, since Lucene is shaded in as part of our build process.
>> > >
>> > > Some more context about the use case. Yes, the terms are pretty much
>> > unique;
>> > > the schema for the data set is actually borrowed from here:
>> > > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the
>> > > UserVisits set, with a couple of other fields added by us. The
>> > > values for the fields are generated almost randomly, though some
>> > > string fields are picked at random from a fixed dictionary.
>> > >
>> > > Also, this type of heap footprint might be tolerable if it stayed
>> > relatively
>> > > constant throughout the system's life cycle (of course, given the
>> > > index
>> > set
>> > > stays more or less static). However, what happens here is that one
>> > > IndexReader reference is maintained by ReaderManager as an NRT
>> reader.
>> > But
>> > > we also would like support an ability to execute searches against
>> > specific
>> > > index commit points, ideally in parallel. As you might imagine, as
>> > > soon
>> > as a
>> > > new DirectoryReader is opened at a given commit, a whole new set of
>> > > SegmentReader instances is created and populated, effectively
>> > > doubling
>> > the
>> > > already large heap usage... if there was a way to somehow reuse
>> > > readers
>> > for
>> > > unchanged segments already pooled by IndexWriter, that would help
>> > > tremendously here. But I don't think there's a way to link up the
>> > > two
>> > sets,
>> > > at least not in the Lucene version we are using (4.6.1) - is this
>> > correct?
>> > >
>> > >
>> > > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
>> > > <lucene@mikemccandless.com> wrote:
>> > >>
>> > >> This is surprising: unless you have an excessive number of unique
>> > >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
>> > >>
>> > >> Bu you only have 12 unique fields?
>> > >>
>> > >> Can you post screen shots of the heap usage?
>> > >>
>> > >> Mike McCandless
>> > >>
>> > >> http://blog.mikemccandless.com
>> > >>
>> > >>
>> > >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein
>> > >> <vfunstein@gmail.com>
>> > >> wrote:
>> > >> > This is a follow up to the earlier thread I started to understand
>> > memory
>> > >> > usage patterns of SegmentReader instances, but I decided to
>> > >> > create a separate post since this issue is much more serious than
>> > >> > the heap overhead created by use of stored field compression.
>> > >> >
>> > >> > Here is the use case, once again. The index totals around 300M
>> > >> > documents, with 7 string, 2 long, 1 integer, 1 date and 1 float
>> > >> > fields which are both indexed and stored. It is split into 4
>> > >> > shards, which are basically separate indices... if that matters.
>> > >> > After the index is populated (but not optimized since we don't
do
>> > >> > that), the overall heap usage taken up by Lucene is over
>> > >> > 1 GB, much of which is taken up by instances of
>> BlockTreeTermsReader.
>> > >> > For
>> > >> > instance for the largest segment in one such an index, the
>> > >> > retained
>> > heap
>> > >> > size of the internal tree map is around 50 MB. This is evident
>> > >> > from
>> > heap
>> > >> > dump analysis, which I have screenshots of that I can post here,
>> > >> > if
>> > that
>> > >> > helps. As there are many segments of various sizes in the index,
>> > >> > as expected, the total heap usage for one shard stands at around
280
>> MB.
>> > >> >
>> > >> > Could someone shed some light on whether this is expected, and
if
>> > >> > so - how could I possibly trim down memory usage here? Is there
a
>> > >> > way to switch to a different terms index implementation, one that
>> > >> > doesn't preload all the terms into RAM, or only does this
>> > >> > partially, i.e. as a cache? I'm not sure if I'm framing my
>> > >> > questions correctly, as I'm obviously not an expert
>> > on
>> > >> > Lucene's internals, but this is going to become a critical issue
>> > >> > for large scale use cases of our system.
>> > >>
>> > >> -------------------------------------------------------------------
>> > >> -- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> > >>
>> > >
>> > >
>> > >
>> > > --------------------------------------------------------------------
>> > > - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message