Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4796718FFC for ; Tue, 12 May 2015 21:30:03 +0000 (UTC) Received: (qmail 82292 invoked by uid 500); 12 May 2015 21:30:02 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 82189 invoked by uid 500); 12 May 2015 21:30:01 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 82002 invoked by uid 99); 12 May 2015 21:30:01 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 May 2015 21:30:01 +0000 Date: Tue, 12 May 2015 21:30:01 +0000 (UTC) From: "Sergey Shelukhin (JIRA)" To: dev@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-10693) LLAP: DAG got stuck after reducer fetch failed MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Sergey Shelukhin created HIVE-10693: --------------------------------------- Summary: LLAP: DAG got stuck after reducer fetch failed Key: HIVE-10693 URL: https://issues.apache.org/jira/browse/HIVE-10693 Project: Hive Issue Type: Sub-task Reporter: Sergey Shelukhin Assignee: Siddharth Seth Internal app ID application_1429683757595_0912, LLAP application_1429683757= 595_0911. If someone without access wants to investigate I'll get the logs. I've ran into this only once. Feel free to close as not repro, I'll reopen = if I see again :) I want to make sure some debug info is preserved just in = case. Running Q1 - Map 1 w/1000 tasks (in this particular case), followed by Redu= cer 2 and Reducer 3, 1 task each, IIRC 3 is uber. Fetch failed with I'd assume some random disturbance in the force: {noformat} 2015-05-12 13:37:31,056 [fetcher [Map_1] #17()] WARN org.apache.tez.runtime= .library.common.shuffle.orderedgrouped.FetcherOrderedGrouped: Failed to ver= ify reply after connecting to cn047-10.l42scl.hortonworks.com:15551 with 1 = inputs pending java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.$$YJP$$socketRead0(Native Method) at java.net.SocketInputStream.socketRead0(SocketInputStream.java) at java.net.SocketInputStream.read(SocketInputStream.java:150) at java.net.SocketInputStream.read(SocketInputStream.java:121) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:787) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpU= RLConnection.java:1534) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpUR= LConnection.java:1439) at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getI= nputStream(HttpConnection.java:256) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetc= herOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:339) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetc= herOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:257) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetc= herOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:167) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetc= herOrderedGrouped.run(FetcherOrderedGrouped.java:182) {noformat} AM registered this as Map 1 task failure {noformat} 2015-05-12 13:37:31,156 INFO [Dispatcher thread: Central] impl.TaskAttemptI= mpl: attempt_1429683757595_0912_1_00_000998_0 blamed for read error from at= tempt_1429683757595_0912_1_01_000000_0 at inputIndex 998 ... 2015-05-12 13:37:31,174 INFO [Dispatcher thread: Central] impl.TaskImpl: Sc= heduling new attempt for task: task_1429683757595_0912_1_00_000998, current= FailedAttempts: 1, maxFailedAttempts: 4 {noformat} Eventually Map 1 completed {noformat} 2015-05-12 13:38:25,247 INFO [Dispatcher thread: Central] history.HistoryEv= entHandler: [HISTORY][DAG:dag_1429683757595_0912_1][Event:VERTEX_FINISHED]:= vertexName=3DMap 1, vertexId=3Dvertex_1429683757595_0912_1_00, initRequest= edTime=3D1431462752913, initedTime=3D1431462754818, startRequestedTime=3D14= 31462754819, startedTime=3D1431462754819, finishTime=3D1431463105101, timeT= aken=3D350282, status=3DSUCCEEDED, diagnostics=3D, counters=3DCounters: 29,= org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=3D59, RACK_LOC= AL_TASKS=3D941, File System Counters, FILE_BYTES_READ=3D2160704, FILE_BYTES= _WRITTEN=3D20377550, FILE_READ_OPS=3D0, FILE_LARGE_READ_OPS=3D0, FILE_WRITE= _OPS=3D0, HDFS_BYTES_READ=3D9798097828287, HDFS_BYTES_WRITTEN=3D0, HDFS_REA= D_OPS=3D406131, HDFS_LARGE_READ_OPS=3D0, HDFS_WRITE_OPS=3D0, org.apache.tez= .common.counters.TaskCounter, SPILLED_RECORDS=3D4000, GC_TIME_MILLIS=3D7330= 9, CPU_MILLISECONDS=3D0, PHYSICAL_MEMORY_BYTES=3D-1000, VIRTUAL_MEMORY_BYTE= S=3D-1000, COMMITTED_HEAP_BYTES=3D25769803776000, INPUT_RECORDS_PROCESSED= =3D5861038, OUTPUT_RECORDS=3D4000, OUTPUT_BYTES=3D376000, OUTPUT_BYTES_WITH= _OVERHEAD=3D0, OUTPUT_BYTES_PHYSICAL=3D0, ADDITIONAL_SPILLS_BYTES_WRITTEN= =3D0, ADDITIONAL_SPILLS_BYTES_READ=3D0, ADDITIONAL_SPILL_COUNT=3D0, HIVE, D= ESERIALIZE_ERRORS=3D0, RECORDS_IN_Map_1=3D5999989709, RECORDS_OUT_INTERMEDI= ATE_Map_1=3D4000, vertexStats=3DfirstTaskStartTime=3D1431462757804, firstTa= sksToStart=3D[ task_1429683757595_0912_1_00_000000 ], lastTaskFinishTime=3D= 1431463105085, lastTasksToFinish=3D[ task_1429683757595_0912_1_00_000999 ],= minTaskDuration=3D1743, maxTaskDuration=3D236653, avgTaskDuration=3D6377.3= 34000000002, numSuccessfulTasks=3D1000, shortestDurationTasks=3D[ task_1429= 683757595_0912_1_00_000472 ], longestDurationTasks=3D[ task_1429683757595_0= 912_1_00_000999 ], vertexTaskStats=3D{numFailedTaskAttempts=3D1, numKilledT= askAttempts=3D0, numCompletedTasks=3D1000, numSucceededTasks=3D1000, numKil= ledTasks=3D0, numFailedTasks=3D0} {noformat} Then AM logged this (not sure if it's relevant) {noformat} 2015-05-12 13:38:25,249 INFO [Dispatcher thread: Central] container.AMConta= inerImpl: AMContainer container_222212222_ 0912_01_001000 transitioned from RUNNING to IDLE via event C_TA_SUCCEEDED20= 15-05-12 13:38:25,250 INFO [Dispatcher thread: Central] history.HistoryEven= tHandler: [HISTORY][DAG:dag_142968375759 5_0912_1][Event:CONTAINER_STOPPED]: containerId=3Dcontainer_222212222_0912_= 01_001000, stoppedTime=3D1431463105250, exitStatus=3D0 2015-05-12 13:38:25,250 INFO [Dispatcher thread: Central] container.AMConta= inerImpl: AMContainer container_222212222_0912_01_001000 transitioned from = IDLE to STOP_REQUESTED via event C_STOP_REQUEST 2015-05-12 13:38:25,253 INFO [Dispatcher thread: Central] impl.VertexImpl: = Source task attempt completed for vertex: vertex_1429683757595_0912_1_01 [R= educer 2] attempt: attempt_1429683757595_0912_1_00_000999_0 with state: SUC= CEEDED vertexState: RUNNING 2015-05-12 13:38:25,254 INFO [Dispatcher thread: Central] impl.DAGImpl: Ver= tex vertex_1429683757595_0912_1_00 [Map 1] completed., numCompletedVertices= =3D1, numSuccessfulVertices=3D1, numFailedVertices=3D0, numKilledVertices= =3D0, numVertices=3D3 2015-05-12 13:38:25,254 INFO [Dispatcher thread: Central] impl.DAGImpl: Che= cking vertices for DAG completion, numCompletedVertices=3D1, numSuccessfulV= ertices=3D1, numFailedVertices=3D0, numKilledVertices=3D0, numVertices=3D3,= commitInProgress=3D0, terminationCause=3Dnull 2015-05-12 13:38:25,254 INFO [Dispatcher thread: Central] tezplugins.LlapCo= ntainerLauncher: DEBUG: Ignoring STOP_REQUEST for event: EventType: CONTAIN= ER_STOP_REQUEST 2015-05-12 13:38:25,254 INFO [Dispatcher thread: Central] container.AMConta= inerImpl: AMContainer container_222212222_0912_01_001000 transitioned from = STOP_REQUESTED to STOPPING via event C_NM_STOP_SENT 2015-05-12 13:38:35,086 INFO [LlapSchedulerNodeEnabler] impl.LlapYarnRegist= ryImpl: Starting to refresh ServiceInstanceSet 1959961953 13:53 and after that the job was stuck for 4 minutes until I killed it, wit= h no other log messages aside from refreshes {noformat} And nothing happened for 4+ minutes until I killed the job. Reducer 2 had a running attempt: {noformat} 2015-05-12 13:34:28,595 INFO [Dispatcher thread: Central] history.HistoryEv= entHandler: [HISTORY][DAG:dag_1429683757595_0912_1][Event:TASK_ATTEMPT_STAR= TED]: vertexName=3DReducer 2, taskAttemptId=3Dattempt_1429683757595_0912_1_= 01_000000_0, startTime=3D1431462868595, containerId=3Dcontainer_222212222_0= 912_01_001001, nodeId=3Dcn041-10.l42scl.hortonworks.com:15001, inProgressLo= gs=3Dhostname:0/node/containerlogs/container_222212222_0912_01_001001/sersh= e, completedLogs=3Dhttp://cn042-10.l42scl.hortonworks.com:19888/jobhistory/= logs/cn041-10.l42scl.hortonworks.com:15001/container_222212222_0912_01_0010= 01/v_Reducer 2_attempt_1429683757595_0912_1_01_000000_0/sershe {noformat} It looks like it actually completed on LLAP {noformat} 2015-05-12 13:38:25,608 [TezTaskRunner_attempt_1429683757595_0912_1_01_0000= 00_0(container_222212222_0912_01_001001_sershe_20150512133230_d8191d4e-069c= -4f39-a902-3ceee5efba03:1_Reducer 2_0_0)] INFO org.apache.hadoop.hive.ql.lo= g.PerfLogger: {noformat} But AM never received the completion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)