ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eugene miretsky <eugene.miret...@gmail.com>
Subject Re: Node keeps crashing under load
Date Fri, 07 Sep 2018 17:01:09 GMT
Hi all,

Can somebody please provide some pointers on what could be the issue or how
to debug it? We have a fairly large Ignite use case, but cannot go ahead
with a POC because of these crashes.

Cheers,
Eugene



On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <eugene.miretsky@gmail.com>
wrote:

> Also, don't want to spam the mailing list with more threads, but I get the
> same stability issue when writing to Ignite from Spark. Logfile from the
> crashed node (not same node as before, probably random) is attached.
>
>  I have also attached a gc log from another node (I have gc logging
> enabled only on one node)
>
>
> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
> eugene.miretsky@gmail.com> wrote:
>
>> Thanks Denis,
>>
>> Execution plan + all logs right after the carsh are attached.
>>
>> Cheers,
>> Eugene
>>  nohup.out
>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>
>>
>>
>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dmagda@apache.org> wrote:
>>
>>> Eugene,
>>>
>>> Please share full logs from all the nodes and execution plan for the
>>> query. That's what the community usually needs to help with
>>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>
>>> --
>>> Denis
>>>
>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2 nodes.
>>>> It has persistence enabled, and zero backup.
>>>> - Full configs are attached.
>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>
>>>> The table has 145M rows, and takes up about 180G of memory
>>>> I testing 2 things
>>>> 1) Writing SQL tables from Spark
>>>> 2) Performing large SQL queries (from the web console): for example Select
>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) < 1)
>>>>
>>>> Most of the times I run the query it fails after one of the nodes
>>>> crashes (it has finished a few times, and then crashed the next time). I
>>>> have also similar stability issues when writing from Spark - at some point,
>>>> one of the nodes crashes. All I can see in the logs is
>>>>
>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>> system error detected. Will be handled accordingly to configured handler
>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
>>>> [type=SEGMENTATION, err=null]]
>>>>
>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>> Ignite node is in invalid state due to a critical failure.
>>>>
>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
>>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>>
>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>>
>>>> My questions are:
>>>> 1) What is causing the issue?
>>>> 2) How can I debug it better?
>>>>
>>>> The rate of crashes and our lack of ability to debug them is becoming
>>>> quite a concern.
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>>
>>>>

Mime
View raw message