hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-10649) LLAP: AM gets stuck completely if one node is dead
Date Thu, 07 May 2015 22:03:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533466#comment-14533466
] 

Gopal V commented on HIVE-10649:
--------------------------------

This is the expected behaviour for stale nodes in the Registry - task gets scheduled onto
nodes and the nodes which fail with COMM_ERROR gets blacklisted after 1 failure.

The hadoop-ipc retries 50 times internally before exiting with an error - I run with the max
retries set to 3 to get the failure modes to work in under a minute or so. That's a way to
get to blacklisting in 3 seconds instead.

> LLAP: AM gets stuck completely if one node is dead
> --------------------------------------------------
>
>                 Key: HIVE-10649
>                 URL: https://issues.apache.org/jira/browse/HIVE-10649
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergey Shelukhin
>            Assignee: Siddharth Seth
>
> See HIVE-10648.
> When AM cannot connect to a node, that appears to cause it to stall; example log, there
are no other interleaving logs even though this is happening in the middle of Map 1 on TPCH
q1, i.e. there are plenty of tasks scheduled.
> From "Assigning" messages I can also see tasks are scheduled to all the nodes before
and after the pause, not just to the problematic node. 
> LLAP daemons have corresponding gaps where between two fragments nothing is ran for a
long time on any daemon.
> {noformat}
> 2015-05-07 12:13:46,679 INFO [Dispatcher thread: Central] impl.TaskImpl: task_1429683757595_0784_1_00_000276
Task Transitioned from SCHEDULED to RUNNING due to event T_ATTEMPT_LAUNCHED
> 2015-05-07 12:13:46,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 10 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:46,955 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting
to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:13:47,811 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 11 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:48,812 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 12 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:49,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 13 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:50,813 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 14 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:51,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 15 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:52,814 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 16 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:53,815 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 17 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:54,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 18 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:55,816 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 19 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:56,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 20 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:56,971 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting
to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:13:57,817 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 21 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:58,818 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 22 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:13:59,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 23 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:00,819 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 24 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:01,820 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 25 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:02,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 26 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:03,821 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 27 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:04,822 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 28 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:05,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 29 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:06,823 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 30 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:06,984 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting
to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:14:07,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 31 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:08,824 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 32 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:09,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 33 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:10,825 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 34 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:11,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 35 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:12,826 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 36 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:13,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 37 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:14,827 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 38 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:15,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 39 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:16,828 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 40 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:16,996 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegistryImpl: Starting
to refresh ServiceInstanceSet 1611673583
> 2015-05-07 12:14:17,829 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 41 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:18,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 42 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:19,830 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 43 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:20,831 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 44 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:21,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 45 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:22,832 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 46 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:23,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 47 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:24,833 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 48 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:25,834 INFO [TaskCommunicator # 3] ipc.Client: Retrying connect to server:
cn059-10.l42scl.hortonworks.com/172.19.128.59:15001. Already tried 49 time(s); retry policy
is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
> 2015-05-07 12:14:25,836 INFO [TaskCommunicator # 3] tezplugins.LlapTaskCommunicator:
Unable to run task: attempt_1429683757595_0784_1_00_000017_0 on containerId: container_222212222_0784_01_000018,
Communication Error
> 2015-05-07 12:14:25,841 INFO [Dispatcher thread: Central] history.HistoryEventHandler:
[HISTORY][DAG:dag_1429683757595_0784_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1429683757595_0784_1_00_000017_0,
startTime=1431026014322, finishTime=1431026065838, timeTaken=51516, status=KILLED, errorEnum=COMMUNICATION_ERROR,
diagnostics=Communication Error, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter,
DATA_LOCAL_TASKS=1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message