ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Veselovsky (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (IGNITE-4720) Sporadically fails for Hadoop
Date Fri, 17 Feb 2017 18:37:42 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872091#comment-15872091
] 

Ivan Veselovsky edited comment on IGNITE-4720 at 2/17/17 6:37 PM:
------------------------------------------------------------------

Node logs show that some nodes were failed and excluded from the topology during the test.
This happens nearly at the same time when the test failures observed:
{code} 
....
 [16:19:00,784][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] Connect timed out (consider
increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47103, failureDetectionTimeout=1000
   0]
130 [16:19:00,788][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] Connect timed out
(consider increasing 'failureDetectionTimeout' configuration property) [addr=/172.25.2.17:47103,
failureDetectionTimeout=10    000]
131 [16:19:00,788][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] Failed to connect
to a remote node (make sure that destination node is alive and operating system firewall is
disabled on local and remote ho    sts) [addrs=[/127.0.0.1:47103, /172.25.2.17:47103]]
132 [16:19:00,789][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] TcpCommunicationSpi
failed to establish connection to node, node will be dropped from cluster [rmtNode=TcpDiscoveryNode
[id=ca2c554d-6d48-4a6    1-abe5-d1d188cc3f53, addrs=[127.0.0.1, 172.25.2.17], sockAddrs=[/172.25.2.17:47503,
/127.0.0.1:47503], discPort=47503, order=4, intOrder=4, lastExchangeTime=1487337395028, loc=false,
ver=1.8.3#20170217-sha1:924    93562, isClient=false], err=class o.a.i.IgniteCheckedException:
Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache
Transaction has a timeout set in order to prevent part    ies from waiting forever in case
of network issues [nodeId=ca2c554d-6d48-4a61-abe5-d1d188cc3f53, addrs=[/127.0.0.1:47103, /172.25.2.17:47103]],
connectErrs=[class o.a.i.IgniteCheckedException: Failed to connect     to address: /127.0.0.1:47103,
class o.a.i.IgniteCheckedException: Failed to connect to address: /172.25.2.17:47103]]
133 [16:19:00,795][WARN ][disco-event-worker-#29%null%][GridDiscoveryManager] Node FAILED:
TcpDiscoveryNode [id=ca2c554d-6d48-4a61-abe5-d1d188cc3f53, addrs=[127.0.0.1, 172.25.2.17],
sockAddrs=[/172.25.2.17:47503, /    127.0.0.1:47503], discPort=47503, order=4, intOrder=4,
lastExchangeTime=1487337395028, loc=false, ver=1.8.3#20170217-sha1:92493562, isClient=false]
134 [16:19:00,796][INFO ][disco-event-worker-#29%null%][GridDiscoveryManager] Topology snapshot
[ver=5, servers=3, clients=0, CPUs=4, heap=4.4GB]
....
{code}

>From the logs and configs it appears that  igfs:// is used as default file system, so
we may need to run the same tests with e.g. file:// file system to exclude IGFS.
Similar failures related to timeouts were observed in Ignite clusters under high load when
running Map-Reduce jobs. Special configuration tuning (increased timeouts, etc.) were used
to overcome the problem.

Looks like in QA tests several map-red jobs are running on the same cluster consequently,
so previous job may affect subsequent. 


was (Author: iveselovskiy):
Node logs show that some nodes were failed and excluded from the topology during the test.
This happens nearly at the same time when the test failures observed:
{code} 
....
 [16:19:00,784][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] Connect timed out (consider
increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47103, failureDetectionTimeout=1000
   0]
130 [16:19:00,788][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] Connect timed out
(consider increasing 'failureDetectionTimeout' configuration property) [addr=/172.25.2.17:47103,
failureDetectionTimeout=10    000]
131 [16:19:00,788][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] Failed to connect
to a remote node (make sure that destination node is alive and operating system firewall is
disabled on local and remote ho    sts) [addrs=[/127.0.0.1:47103, /172.25.2.17:47103]]
132 [16:19:00,789][WARN ][tcp-comm-worker-#1%null%][TcpCommunicationSpi] TcpCommunicationSpi
failed to establish connection to node, node will be dropped from cluster [rmtNode=TcpDiscoveryNode
[id=ca2c554d-6d48-4a6    1-abe5-d1d188cc3f53, addrs=[127.0.0.1, 172.25.2.17], sockAddrs=[/172.25.2.17:47503,
/127.0.0.1:47503], discPort=47503, order=4, intOrder=4, lastExchangeTime=1487337395028, loc=false,
ver=1.8.3#20170217-sha1:924    93562, isClient=false], err=class o.a.i.IgniteCheckedException:
Failed to connect to node (is node still alive?). Make sure that each ComputeTask and cache
Transaction has a timeout set in order to prevent part    ies from waiting forever in case
of network issues [nodeId=ca2c554d-6d48-4a61-abe5-d1d188cc3f53, addrs=[/127.0.0.1:47103, /172.25.2.17:47103]],
connectErrs=[class o.a.i.IgniteCheckedException: Failed to connect     to address: /127.0.0.1:47103,
class o.a.i.IgniteCheckedException: Failed to connect to address: /172.25.2.17:47103]]
133 [16:19:00,795][WARN ][disco-event-worker-#29%null%][GridDiscoveryManager] Node FAILED:
TcpDiscoveryNode [id=ca2c554d-6d48-4a61-abe5-d1d188cc3f53, addrs=[127.0.0.1, 172.25.2.17],
sockAddrs=[/172.25.2.17:47503, /    127.0.0.1:47503], discPort=47503, order=4, intOrder=4,
lastExchangeTime=1487337395028, loc=false, ver=1.8.3#20170217-sha1:92493562, isClient=false]
134 [16:19:00,796][INFO ][disco-event-worker-#29%null%][GridDiscoveryManager] Topology snapshot
[ver=5, servers=3, clients=0, CPUs=4, heap=4.4GB]
....
{code}

>From the logs and configs it appears that  igfs:// is used as default file system, so
we may need to run the same tests with e.g. file:// file system to exclude IGFS.
Similar failures related to timeouts were observed in Ignite clusters under high load when
running Map-Reduce jobs. Special configuration tuning (increased timeouts, etc.) were used
to overcome the problem.

> Sporadically fails for Hadoop
> -----------------------------
>
>                 Key: IGNITE-4720
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4720
>             Project: Ignite
>          Issue Type: Bug
>          Components: hadoop
>    Affects Versions: 1.8
>            Reporter: Irina Zaporozhtseva
>            Assignee: Ivan Veselovsky
>             Fix For: 1.9
>
>
> hadoop example aggregatewordcount under apache ignite hadoop edition grid with 4 nodes
for hadoop-2_6_4 and hadoop-2_7_2:
> aggregatewordcount returns 999712 instead of 1000000



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message