ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Kasnacheev <ilya.kasnach...@gmail.com>
Subject Re: Node keeps crashing under load
Date Mon, 10 Sep 2018 16:04:21 GMT
Hello!

I can see a lot of errors like this one:

[04:05:29,268][INFO][tcp-comm-worker-#1%Server%][ZookeeperDiscoveryImpl]
Created new communication error process future
[errNode=598e3ead-99b8-4c49-b7df-04d578dcbf5f, err=class
org.apache.ignite.IgniteCheckedException: Failed to connect to node (is
node still alive?). Make sure that each ComputeTask and cache Transaction
has a timeout set in order to prevent parties from waiting forever in case
of network issues [nodeId=598e3ead-99b8-4c49-b7df-04d578dcbf5f,
addrs=[ip-172-17-0-1.ap-south-1.compute.internal/172.17.0.1:47100,
ip-172-21-85-213.ap-south-1.compute.internal/172.21.85.213:47100,
/0:0:0:0:0:0:0:1%lo:47100, /127.0.0.1:47100]]]

I think the problem is, you have two nodes, they both have 172.17.0.1
address but it's the different address (totally unrelated private nets).

Try to specify your external address (such as 172.21.85.213) with
TcpCommunicationSpi.setLocalAddress() on each node.

Regards,
-- 
Ilya Kasnacheev


пт, 7 сент. 2018 г. в 20:01, eugene miretsky <eugene.miretsky@gmail.com>:

> Hi all,
>
> Can somebody please provide some pointers on what could be the issue or
> how to debug it? We have a fairly large Ignite use case, but cannot go
> ahead with a POC because of these crashes.
>
> Cheers,
> Eugene
>
>
>
> On Fri, Aug 31, 2018 at 11:52 AM eugene miretsky <
> eugene.miretsky@gmail.com> wrote:
>
>> Also, don't want to spam the mailing list with more threads, but I get
>> the same stability issue when writing to Ignite from Spark. Logfile from
>> the crashed node (not same node as before, probably random) is attached.
>>
>>  I have also attached a gc log from another node (I have gc logging
>> enabled only on one node)
>>
>>
>> On Fri, Aug 31, 2018 at 11:23 AM eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Thanks Denis,
>>>
>>> Execution plan + all logs right after the carsh are attached.
>>>
>>> Cheers,
>>> Eugene
>>>  nohup.out
>>> <https://drive.google.com/file/d/10TvQOYgOAJpedrTw3IQ5ABwlW5QpK1Bc/view?usp=drive_web>
>>>
>>>
>>>
>>> On Fri, Aug 31, 2018 at 1:53 AM Denis Magda <dmagda@apache.org> wrote:
>>>
>>>> Eugene,
>>>>
>>>> Please share full logs from all the nodes and execution plan for the
>>>> query. That's what the community usually needs to help with
>>>> troubleshooting. Also, attach GC logs. Use these settings to gather them:
>>>> https://apacheignite.readme.io/docs/jvm-and-system-tuning#section-detailed-garbage-collection-stats
>>>>
>>>> --
>>>> Denis
>>>>
>>>> On Thu, Aug 30, 2018 at 3:19 PM eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a medium cluster set up for testings - 3 x r4.8xlarge EC2
>>>>> nodes. It has persistence enabled, and zero backup.
>>>>> - Full configs are attached.
>>>>> - JVM settings are: JVM_OPTS="-Xms16g -Xmx64g -server
>>>>> -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m  -XX:+AlwaysPreTouch
>>>>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC"
>>>>>
>>>>> The table has 145M rows, and takes up about 180G of memory
>>>>> I testing 2 things
>>>>> 1) Writing SQL tables from Spark
>>>>> 2) Performing large SQL queries (from the web console): for example Select
>>>>> COUNT (*) FROM (SELECT customer_id FROM MyTable where dt > '2018-05-12'
>>>>> GROUP BY customer_id having SUM(column1) > 2 AND MAX(column2) <
1)
>>>>>
>>>>> Most of the times I run the query it fails after one of the nodes
>>>>> crashes (it has finished a few times, and then crashed the next time).
I
>>>>> have also similar stability issues when writing from Spark - at some
point,
>>>>> one of the nodes crashes. All I can see in the logs is
>>>>>
>>>>> [21:51:58,548][SEVERE][disco-event-worker-#101%Server%][] Critical
>>>>> system error detected. Will be handled accordingly to configured handler
>>>>> [hnd=class o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext
>>>>> [type=SEGMENTATION, err=null]]
>>>>>
>>>>> [21:51:58,549][SEVERE][disco-event-worker-#101%Server%][FailureProcessor]
>>>>> Ignite node is in invalid state due to a critical failure.
>>>>>
>>>>> [21:51:58,549][SEVERE][node-stopper][] Stopping local node on Ignite
>>>>> failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
>>>>>
>>>>> [21:52:03] Ignite node stopped OK [name=Server, uptime=00:07:06.780]
>>>>>
>>>>> My questions are:
>>>>> 1) What is causing the issue?
>>>>> 2) How can I debug it better?
>>>>>
>>>>> The rate of crashes and our lack of ability to debug them is becoming
>>>>> quite a concern.
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>>
>>>>>
>>>>>

Mime
View raw message