lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Possible memory leak after segment merge? (related to DocValues?)
Date Fri, 27 Dec 2013 12:25:21 GMT
Likely this is for field norms, which use doc values under the hood.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Dec 26, 2013 at 5:03 PM, Greg Preston
<gpreston@marinsoftware.com> wrote:
> Does anybody with knowledge of solr internals know why I'm seeing
> instances of Lucene42DocValuesProducer when I don't have any fields
> that are using DocValues?  Or am I misunderstanding what this class is
> for?
>
> -Greg
>
>
> On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston
> <gpreston@marinsoftware.com> wrote:
>> Hello,
>>
>> I'm loading up our solr cloud with data (from a solrj client) and
>> running into a weird memory issue.  I can reliably reproduce the
>> problem.
>>
>> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
>> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
>> host has 256G of memory
>> - index and tlogs on ssd
>> - Xmx=7G, G1GC
>> - Java 1.7.0_25
>> - schema and solrconfig.xml attached
>>
>> I'm using composite routing to route documents with the same clientId
>> to the same shard.  After several hours of indexing, I occasionally
>> see an IndexWriter go OOM.  I think that's a symptom.  When that
>> happens, indexing continues, and that node's tlog starts to grow.
>> When I notice this, I stop indexing, and bounce the problem node.
>> That's where it gets interesting.
>>
>> Upon bouncing, the tlog replays, and then segments merge.  Once the
>> merging is complete, the heap is fairly full, and forced full GC only
>> helps a little.  But if I then bounce the node again, the heap usage
>> goes way down, and stays low until the next segment merge.  I believe
>> segment merges are also what causes the original OOM.
>>
>> More details:
>>
>> Index on disk for this node is ~13G, tlog is ~2.5G.
>> See attached mem1.png.  This is a jconsole view of the heap during the
>> following:
>>
>> (Solr cloud node started at the left edge of this graph)
>>
>> A) One CPU core pegged at 100%.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
>>         at org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
>>         at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
>>         at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
>>         at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
>> memory freed.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>>         at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>>         at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
>> freed.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
>>         at org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>>         at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>>         at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> D) One CPU core pegged at 100%.  Thread dump shows:
>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
>> nid=0x7a74 runnable [0x00007f5a41c5f000]
>>    java.lang.Thread.State: RUNNABLE
>>         at org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
>>         at org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
>>         at org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
>>         at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
>>         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
>>         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>>         at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>>
>> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge Thread.
>>
>> F) Manually triggered full GC.  Some memory freed, but much remains.
>>
>> G) Restarted solr.  Very little memory used.
>>
>>
>> Throughout all of this, there was no indexing or querying happening.
>> In ordered to try to determine what's using up the memory, I took a
>> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
>> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
>> each holding a large byte[].  By attaching a remote debugger and
>> re-running, it looks like there is one of these byte[] for each field
>> in the schema (we have several of the "dim_*" dynamic fields).  And as
>> far as I know, I'm not using DocValues at all.
>>
>>
>> Any hints as to what might be going on here would be greatly
>> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
>> willing to try things.  I don't know enough about the internals of
>> solr's memory usage to proceed much further on my own.
>>
>> Thank you.
>>
>> -Greg

Mime
View raw message