lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: OutOfMemoryError in 6.5.1
Date Tue, 21 Nov 2017 17:13:23 GMT
Walter:

Yeah, I've seen this on occasion. IIRC, the OOM exception will be
specific to running out of stack space, or at least slightly different
than the "standard" OOM error. That would be the "smoking gun" for too
many threads....

Erick

On Tue, Nov 21, 2017 at 9:00 AM, Walter Underwood <wunder@wunderwood.org> wrote:
> I do have one theory about the OOM. The server is running out of memory because there
are too many threads. Instead of queueing up overload in the load balancer, it is queue in
new threads waiting to run. Setting solr.jetty.threads.max to 10,000 guarantees this will
happen under overload.
>
> New Relic shows this clearly. CPU hits 100% at 15:40, thread count and load average start
climbing. At 15:43, it reaches 3000 threads and starts throwing OOM. After that, the server
is in a stable congested state.
>
> I understand why the Jetty thread max was set so high, but I think the cure is worse
than the disease. We’ll run another load benchmark with thread max at something realistic,
like 200.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Nov 21, 2017, at 8:17 AM, Walter Underwood <wunder@wunderwood.org> wrote:
>>
>> All our customizations are in solr.in.sh. We’re using the one we configured for
6.3.0. I’ll check for any differences between that and the 6.5.1 script.
>>
>> I don’t see any arguments at all in the dashboard. I do see them in a ps listing,
right at the end.
>>
>> java -server -Xms8g -Xmx8g -XX:+UseG1GC -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m
-XX:MaxGCPauseMillis=200 -XX:+UseLargePages -XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError
-verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:/solr/logs/solr_gc.log
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.port=18983 -Dcom.sun.management.jmxremote.rmi.port=18983 -Djava.rmi.server.hostname=new-solr-c01.test3.cloud.cheggnet.com
-DzkClientTimeout=15000 -DzkHost=zookeeper1.test3.cloud.cheggnet.com:2181,zookeeper2.test3.cloud.cheggnet.com:2181,zookeeper3.test3.cloud.cheggnet.com:2181/solr-cloud
-Dsolr.log.level=WARN -Dsolr.log.dir=/solr/logs -Djetty.port=8983 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks
-Dhost=new-solr-c01.test3.cloud.cheggnet.com -Duser.timezone=UTC -Djetty.home=/apps/solr6/server
-Dsolr.solr.home=/apps/solr6/server/solr -Dsolr.install.dir=/apps/solr6 -Dgraphite.prefix=solr-cloud.new-solr-c01
-Dgraphite.host=influx.test.cheggnet.com -javaagent:/apps/solr6/newrelic/newrelic.jar -Dnewrelic.environment=test3
-Dsolr.log.muteconsole -Xss256k -Dsolr.log.muteconsole -XX:OnOutOfMemoryError=/apps/solr6/bin/oom_solr.sh
8983 /solr/logs -jar start.jar --module=http
>>
>> I’m still confused why we are hitting OOM in 6.5.1 but weren’t in 6.3.0. Our
load benchmarks use prod logs. We added suggesters, but those use analyzing infix, so they
are search indexes, not in-memory.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Nov 21, 2017, at 5:46 AM, Shawn Heisey <apache@elyograg.org> wrote:
>>>
>>> On 11/20/2017 6:17 PM, Walter Underwood wrote:
>>>> When I ran load benchmarks with 6.3.0, an overloaded cluster would get super
slow but keep functioning. With 6.5.1, we hit 100% CPU, then start getting OOMs. That is really
bad, because it means we need to reboot every node in the cluster.
>>>> Also, the JVM OOM hook isn’t running the process killer (JVM 1.8.0_121-b13).
Using the G1 collector with the Shawn Heisey settings in an 8G heap.
>>> <snip>
>>>> This is not good behavior in prod. The process goes to the bad place, then
we need to wait until someone is paged and kills it manually. Luckily, it usually drops out
of the live nodes for each collection and doesn’t take user traffic.
>>>
>>> There was a bug, fixed long before 6.3.0, where the OOM killer script wasn't
working because the arguments enabling it were in the wrong place.  It was fixed in 5.5.1
and 6.0.
>>>
>>> https://issues.apache.org/jira/browse/SOLR-8145
>>>
>>> If the scripts that you are using to get Solr started originated with a much
older version of Solr than you are currently running, maybe you've got the arguments in the
wrong order.
>>>
>>> Do you see the commandline arguments for the OOM killer (only available on *NIX
systems, not Windows) on the admin UI dashboard?  If they are properly placed, you will see
them on the dashboard, but if they aren't properly placed, then you won't see them.  This
is what the argument looks like for one of my Solr installs:
>>>
>>> -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs
>>>
>>> Something which you probably already know:  If you're hitting OOM, you need a
larger heap, or you need to adjust the config so it uses less memory.  There are no other
ways to "fix" OOM problems.
>>>
>>> Thanks,
>>> Shawn
>>
>

Mime
View raw message