lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vitaly Funstein <vfunst...@gmail.com>
Subject Re: BlockTreeTermsReader consumes crazy amount of memory
Date Wed, 10 Sep 2014 01:09:18 GMT
Okay, created LUCENE-5931 for this. As it turns out, my original test
actually does do deletes on the index so please disregard my question about
segment merging.


On Tue, Sep 9, 2014 at 3:00 PM, <vfunstein@gmail.com> wrote:

> I'm on 4.6.1. I'll file an issue for sure, but is there a workaround you
> could think of in the meantime? As you probably remember, the reason for
> doing this in the first place was to prevent the catastrophic heap
> exhaustion when SegmentReader instances are opened from scratch for every
> new IndexReader created.
>
> Coming to think of it, I think this issue can be observed even when the
> "old" reader is not NRT but even another commit point.
>
> But just to check my understanding of Lucene in general - was my hunch
> correct in that deletes can be present as a result of segment merging and
> not actual user-driven index updates?
>
> > On Sep 9, 2014, at 10:46 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
> >
> > Hmm, which Lucene version are you using?  We recently beefed up the
> > checking in this code, so you ought to be hitting an exception in
> > newer versions.
> >
> > But that being said, I think the bug is real: if you try to reopen
> > from a newer NRT reader down to an older (commit point) reader then
> > you can hit this.
> >
> > Can you open an issue and maybe post a test case showing it?  Thanks.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> >> On Tue, Sep 9, 2014 at 2:30 AM, Vitaly Funstein <vfunstein@gmail.com>
> wrote:
> >> I think I see the bug here, but maybe I'm wrong. Here's my theory:
> >>
> >> Suppose no segments at a particular commit point contain any deletes.
> Now,
> >> we also hold open an NRT reader into the index, which may end up with
> some
> >> deletes, after the commit occurred. Then, according to the following
> >> conditional in StandardDirectoryReader, we shall get into the two arg
> ctor
> >> of SegmentReader:
> >>
> >>            if (newReaders[i].getSegmentInfo().getDelGen() ==
> >> infos.info(i).getDelGen())
> >> {
> >>              // only DV updates
> >>              newReaders[i] = new SegmentReader(infos.info(i),
> >> newReaders[i], newReaders[i].getLiveDocs(), newReaders[i].numDocs());
> >>            } else {
> >>              // both DV and liveDocs have changed
> >>              newReaders[i] = new SegmentReader(infos.info(i),
> >> newReaders[i]);
> >>            }
> >>
> >> That constructor looks like this:
> >>
> >>  SegmentReader(SegmentCommitInfo si, SegmentReader sr) throws
> IOException {
> >>    this(si, sr,
> >>         si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir,
> si,
> >> IOContext.READONCE),
> >>         si.info.getDocCount() - si.getDelCount());
> >>  }
> >>
> >> At this point, the SegmentInfo we're tryng to read live docs on is from
> the
> >> commit point, and if there weren't any deletes, then the following
> results
> >> in a null for the relevant file name, in
> >> Lucene40LiveDocsFormat.readLiveDocs():
> >>
> >>    String filename = IndexFileNames.fileNameFromGeneration(
> info.info.name,
> >> DELETES_EXTENSION, info.getDelGen());
> >>    final BitVector liveDocs = new BitVector(dir, filename, context);
> >>
> >> This is where filename ends up being null, which gets passed all the way
> >> down to the File constructor.
> >>
> >> In a nutshell, I think the bug is that it is assumed that the segments
> from
> >> commit point have deletes, when they may not, yet the original
> >> SegmentReader for the segment that we are trying to reuse does.
> >>
> >> What I am not quite clear yet is how we arrive at this point, because
> the
> >> test that causes the exception doesn't do any deletes/updates... could
> >> deletes occur as a result of a segment merge? This might explain the
> >> sporadic nature of this exception, since merge timings aren't
> deterministic.
> >>
> >>
> >> On Mon, Sep 8, 2014 at 11:45 AM, Vitaly Funstein <vfunstein@gmail.com>
> >> wrote:
> >>
> >>> UPDATE:
> >>>
> >>> After making the changes we discussed to enable sharing of
> SegmentReaders
> >>> between the NRT reader and a commit point reader, specifically calling
> >>> through to DirectoryReader.openIfChanged(DirectoryReader,
> IndexCommit), I
> >>> am seeing this exception, sporadically:
> >>>
> >>> Caused by: java.lang.NullPointerException
> >>>        at java.io.File.<init>(File.java:305)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.codecs.lucene40.BitVector.<init>(BitVector.java:327)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:90)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.SegmentReader.<init>(SegmentReader.java:131)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:194)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:326)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader$2.doBody(StandardDirectoryReader.java:320)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:702)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromCommit(StandardDirectoryReader.java:315)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:278)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:260)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:183)
> >>>
> >>> Looking at the source quickly, it appears the child argument to the
> File
> >>> ctor is null; is it somehow possible that the segment infos in the
> commit
> >>> point wasn't fully written out somehow, on prior commit? Sounds
> unlikely,
> >>> yet disturbing... but nothing else has changed in my code, i.e. the way
> >>> commits are performed and indexes are reopened.
> >>>
> >>>
> >>> On Fri, Aug 29, 2014 at 2:03 AM, Michael McCandless <
> >>> lucene@mikemccandless.com> wrote:
> >>>
> >>>> On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein <vfunstein@gmail.com
> >
> >>>> wrote:
> >>>>> On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
> >>>>> lucene@mikemccandless.com> wrote:
> >>>>>
> >>>>>>
> >>>>>> The segments_N file can be different, that's fine: after that,
we
> then
> >>>>>> re-use SegmentReaders when they are in common between the two
commit
> >>>>>> points.  Each segments_N file refers to many segments...
> >>>>> Yes, you are totally right - I didn't follow the code far enough
the
> >>>> first
> >>>>> time around. :) This is an excellent idea, actually - I can probably
> >>>>> arrange maintained commit points as an MRU data structure (e.g.
> >>>>> LinkedHashMap with access order), and simply grab the most recently
> >>>> opened
> >>>>> reader to pass in when obtaining a new one from the new commit point
> -
> >>>> to
> >>>>> maximize segment reader reuse.
> >>>>
> >>>> That's great!
> >>>>
> >>>>>> You can set it (min and max) as high as you want; the only hard
> >>>>>> requirement is that max >= 2*(min-1), I believe.
> >>>>>
> >>>>> Looks like this is used inside Lucene41PostingsFormat, which simply
> >>>> passes
> >>>>> in those defaults - so you are effectively saying the minimum (and
> >>>>> therefore, maximum) block size can be raised to reuse the size of
the
> >>>> terms
> >>>>> index inside those TreeMap nodes?
> >>>>
> >>>> Yes, but it then increases cost at search time to locate a given term,
> >>>> because more scanning is then required once we seek to the block that
> >>>> might have the term.
> >>>>
> >>>> This reduces the size of the FST, but if RAM is being used by
> >>>> something else inside BT, it won't help.  But from your screen shot
it
> >>>> looked like it was almost entirely the FST, which is what I would
> >>>> expect.
> >>>>
> >>>>>>> We are already using a customized codec though, so perhaps
adding
> >>>>>>> this to the codec is okay and transparent?
> >>>>>>
> >>>>>> Hmmm :)  Customized in what manner?
> >>>>> We need to have the ability to turn off stored fields compression,
so
> >>>> there
> >>>>> is one codec in case the system is configured that way. The other
one
> >>>>> exists for compression on, but there I tweaked stored fields format
> for
> >>>>> bias toward decompression, as well as a smaller chunk size - based
on
> >>>> some
> >>>>> empirical observations in executed tests. I am guessing I'll just
add
> >>>>> another customization to both that deals with the block sizing for
> >>>> postings
> >>>>> format, and see what difference that makes...
> >>>>
> >>>> Ahh, OK.  Yes, just add this custom terms index block sizing too.
> >>>>
> >>>> Mike McCandless
> >>>>
> >>>> http://blog.mikemccandless.com
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message