Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Tue, 12 May 2015 21:30:01 +0000 (UTC)
From: "Sergey Shelukhin (JIRA)" <jira@apache.org>
To: dev@hive.apache.org
Message-ID: <JIRA.12829342.1431466153000.93956.1431466201792@Atlassian.JIRA>
In-Reply-To: <JIRA.12829342.1431466153000@Atlassian.JIRA>
References: <JIRA.12829342.1431466153000@Atlassian.JIRA>
 <JIRA.12829342.1431466153350@arcas>
Subject: [jira] [Created] (HIVE-10693) LLAP: DAG got stuck after reducer
 fetch failed
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Sergey Shelukhin created HIVE-10693:
---------------------------------------

             Summary: LLAP: DAG got stuck after reducer fetch failed
                 Key: HIVE-10693
                 URL: https://issues.apache.org/jira/browse/HIVE-10693
             Project: Hive
          Issue Type: Sub-task
            Reporter: Sergey Shelukhin
            Assignee: Siddharth Seth


Internal app ID application_1429683757595_0912, LLAP application_1429683757=
595_0911. If someone without access wants to investigate I'll get the logs.
I've ran into this only once. Feel free to close as not repro, I'll reopen =
if I see again :) I want to make sure some debug info is preserved just in =
case.
Running Q1 - Map 1 w/1000 tasks (in this particular case), followed by Redu=
cer 2 and Reducer 3, 1 task each, IIRC 3 is uber.
Fetch failed with I'd assume some random disturbance in the force:
{noformat}
2015-05-12 13:37:31,056 [fetcher [Map_1] #17()] WARN org.apache.tez.runtime=
.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped: Failed to ver=
ify reply after connecting to cn047-10.l42scl.hortonworks.com:15551 with 1 =
inputs pending
java.net.SocketTimeoutException: Read timed out
       at java.net.SocketInputStream.$$YJP$$socketRead0(Native Method)
       at java.net.SocketInputStream.socketRead0(SocketInputStream.java)
       at java.net.SocketInputStream.read(SocketInputStream.java:150)
       at java.net.SocketInputStream.read(SocketInputStream.java:121)
       at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
       at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
       at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
       at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703)
       at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
       at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:787)
       at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
       at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpU=
RLConnection.java:1534)
       at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpUR=
LConnection.java:1439)
       at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getI=
nputStream(HttpConnection.java:256)
       at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetc=
herOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:339)
       at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetc=
herOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:257)
       at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetc=
herOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:167)
       at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetc=
herOrderedGrouped.run(FetcherOrderedGrouped.java:182)
{noformat}

AM registered this as Map 1 task failure
{noformat}
2015-05-12 13:37:31,156 INFO [Dispatcher thread: Central] impl.TaskAttemptI=
mpl: attempt_1429683757595_0912_1_00_000998_0 blamed for read error from at=
tempt_1429683757595_0912_1_01_000000_0 at inputIndex 998
...
2015-05-12 13:37:31,174 INFO [Dispatcher thread: Central] impl.TaskImpl: Sc=
heduling new attempt for task: task_1429683757595_0912_1_00_000998, current=
FailedAttempts: 1, maxFailedAttempts: 4
{noformat}

Eventually Map 1 completed
{noformat}
2015-05-12 13:38:25,247 INFO [Dispatcher thread: Central] history.HistoryEv=
entHandler: [HISTORY][DAG:dag_1429683757595_0912_1][Event:VERTEX_FINISHED]:=
 vertexName=3DMap 1, vertexId=3Dvertex_1429683757595_0912_1_00, initRequest=
edTime=3D1431462752913, initedTime=3D1431462754818, startRequestedTime=3D14=
31462754819, startedTime=3D1431462754819, finishTime=3D1431463105101, timeT=
aken=3D350282, status=3DSUCCEEDED, diagnostics=3D, counters=3DCounters: 29,=
 org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=3D59, RACK_LOC=
AL_TASKS=3D941, File System Counters, FILE_BYTES_READ=3D2160704, FILE_BYTES=
_WRITTEN=3D20377550, FILE_READ_OPS=3D0, FILE_LARGE_READ_OPS=3D0, FILE_WRITE=
_OPS=3D0, HDFS_BYTES_READ=3D9798097828287, HDFS_BYTES_WRITTEN=3D0, HDFS_REA=
D_OPS=3D406131, HDFS_LARGE_READ_OPS=3D0, HDFS_WRITE_OPS=3D0, org.apache.tez=
.common.counters.TaskCounter, SPILLED_RECORDS=3D4000, GC_TIME_MILLIS=3D7330=
9, CPU_MILLISECONDS=3D0, PHYSICAL_MEMORY_BYTES=3D-1000, VIRTUAL_MEMORY_BYTE=
S=3D-1000, COMMITTED_HEAP_BYTES=3D25769803776000, INPUT_RECORDS_PROCESSED=
=3D5861038, OUTPUT_RECORDS=3D4000, OUTPUT_BYTES=3D376000, OUTPUT_BYTES_WITH=
_OVERHEAD=3D0, OUTPUT_BYTES_PHYSICAL=3D0, ADDITIONAL_SPILLS_BYTES_WRITTEN=
=3D0, ADDITIONAL_SPILLS_BYTES_READ=3D0, ADDITIONAL_SPILL_COUNT=3D0, HIVE, D=
ESERIALIZE_ERRORS=3D0, RECORDS_IN_Map_1=3D5999989709, RECORDS_OUT_INTERMEDI=
ATE_Map_1=3D4000, vertexStats=3DfirstTaskStartTime=3D1431462757804, firstTa=
sksToStart=3D[ task_1429683757595_0912_1_00_000000 ], lastTaskFinishTime=3D=
1431463105085, lastTasksToFinish=3D[ task_1429683757595_0912_1_00_000999 ],=
 minTaskDuration=3D1743, maxTaskDuration=3D236653, avgTaskDuration=3D6377.3=
34000000002, numSuccessfulTasks=3D1000, shortestDurationTasks=3D[ task_1429=
683757595_0912_1_00_000472 ], longestDurationTasks=3D[ task_1429683757595_0=
912_1_00_000999 ], vertexTaskStats=3D{numFailedTaskAttempts=3D1, numKilledT=
askAttempts=3D0, numCompletedTasks=3D1000, numSucceededTasks=3D1000, numKil=
ledTasks=3D0, numFailedTasks=3D0}
{noformat}

Then AM logged this (not sure if it's relevant)
{noformat}
2015-05-12 13:38:25,249 INFO [Dispatcher thread: Central] container.AMConta=
inerImpl: AMContainer container_222212222_
0912_01_001000 transitioned from RUNNING to IDLE via event C_TA_SUCCEEDED20=
15-05-12 13:38:25,250 INFO [Dispatcher thread: Central] history.HistoryEven=
tHandler: [HISTORY][DAG:dag_142968375759
5_0912_1][Event:CONTAINER_STOPPED]: containerId=3Dcontainer_222212222_0912_=
01_001000, stoppedTime=3D1431463105250, exitStatus=3D0
2015-05-12 13:38:25,250 INFO [Dispatcher thread: Central] container.AMConta=
inerImpl: AMContainer container_222212222_0912_01_001000 transitioned from =
IDLE to STOP_REQUESTED via event C_STOP_REQUEST
2015-05-12 13:38:25,253 INFO [Dispatcher thread: Central] impl.VertexImpl: =
Source task attempt completed for vertex: vertex_1429683757595_0912_1_01 [R=
educer 2] attempt: attempt_1429683757595_0912_1_00_000999_0 with state: SUC=
CEEDED vertexState: RUNNING
2015-05-12 13:38:25,254 INFO [Dispatcher thread: Central] impl.DAGImpl: Ver=
tex vertex_1429683757595_0912_1_00 [Map 1] completed., numCompletedVertices=
=3D1, numSuccessfulVertices=3D1, numFailedVertices=3D0, numKilledVertices=
=3D0, numVertices=3D3
2015-05-12 13:38:25,254 INFO [Dispatcher thread: Central] impl.DAGImpl: Che=
cking vertices for DAG completion, numCompletedVertices=3D1, numSuccessfulV=
ertices=3D1, numFailedVertices=3D0, numKilledVertices=3D0, numVertices=3D3,=
 commitInProgress=3D0, terminationCause=3Dnull
2015-05-12 13:38:25,254 INFO [Dispatcher thread: Central] tezplugins.LlapCo=
ntainerLauncher: DEBUG: Ignoring STOP_REQUEST for event: EventType: CONTAIN=
ER_STOP_REQUEST
2015-05-12 13:38:25,254 INFO [Dispatcher thread: Central] container.AMConta=
inerImpl: AMContainer container_222212222_0912_01_001000 transitioned from =
STOP_REQUESTED to STOPPING via event C_NM_STOP_SENT
2015-05-12 13:38:35,086 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegist=
ryImpl: Starting to refresh ServiceInstanceSet 1959961953
13:53 and after that the job was stuck for 4 minutes until I killed it, wit=
h no other log messages aside from refreshes
{noformat}

And nothing happened for 4+ minutes until I killed the job.
Reducer 2 had a running attempt:
{noformat}
2015-05-12 13:34:28,595 INFO [Dispatcher thread: Central] history.HistoryEv=
entHandler: [HISTORY][DAG:dag_1429683757595_0912_1][Event:TASK_ATTEMPT_STAR=
TED]: vertexName=3DReducer 2, taskAttemptId=3Dattempt_1429683757595_0912_1_=
01_000000_0, startTime=3D1431462868595, containerId=3Dcontainer_222212222_0=
912_01_001001, nodeId=3Dcn041-10.l42scl.hortonworks.com:15001, inProgressLo=
gs=3Dhostname:0/node/containerlogs/container_222212222_0912_01_001001/sersh=
e, completedLogs=3Dhttp://cn042-10.l42scl.hortonworks.com:19888/jobhistory/=
logs/cn041-10.l42scl.hortonworks.com:15001/container_222212222_0912_01_0010=
01/v_Reducer 2_attempt_1429683757595_0912_1_01_000000_0/sershe
{noformat}

It looks like it actually completed on LLAP
{noformat}
2015-05-12 13:38:25,608 [TezTaskRunner_attempt_1429683757595_0912_1_01_0000=
00_0(container_222212222_0912_01_001001_sershe_20150512133230_d8191d4e-069c=
-4f39-a902-3ceee5efba03:1_Reducer 2_0_0)] INFO org.apache.hadoop.hive.ql.lo=
g.PerfLogger: </PERFLOG method=3DTezRunProcessor start=3D1431462868796 end=
=3D1431463105608 duration=3D236812 from=3Dorg.apache.hadoop.hive.ql.exec.te=
z.TezProcessor>
{noformat}

But AM never received the completion?


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)