hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Dere (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17091) "Timed out getting readerEvents" error from external LLAP client
Date Fri, 14 Jul 2017 01:25:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086699#comment-16086699
] 

Jason Dere commented on HIVE-17091:
-----------------------------------

Looks like a couple different issues at play here:
1) On the LLAP daemon, the executor finished but it's somehow still stuck waiting for the
ChannelOutputStream to finish all writes (even though all of the data was already received
by the client). This might be related to the pendingWrites/writeMonitor logic being used by
the ChannelOutputStream to manage the number of outstanding writes for an external fragment
request. I've tried replacing this mechanism with a Sempaphore, and so far I haven't seen
this issue reoccur.
{noformat}
Thread 1802 (TezTR-683826_93_0_0_29_0):
  State: WAITING
  Blocked count: 456
  Wtaited count: 458
  Waiting on java.lang.Object@7e3b8b1
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:502)
    org.apache.hadoop.hive.llap.ChannelOutputStream.waitForWritesToFinish(ChannelOutputStream.java:153)
    org.apache.hadoop.hive.llap.ChannelOutputStream.close(ChannelOutputStream.java:136)
    java.io.FilterOutputStream.close(FilterOutputStream.java:159)
    org.apache.hadoop.hive.llap.io.ChunkedOutputStream.close(ChunkedOutputStream.java:81)
    java.io.FilterOutputStream.close(FilterOutputStream.java:159)
    org.apache.hadoop.hive.llap.LlapRecordWriter.close(LlapRecordWriter.java:47)
    org.apache.hadoop.hive.ql.io.HivePassThroughRecordWriter.close(HivePassThroughRecordWriter.java:46)
    org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:190)
    org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1039)
    org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
    org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
    org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
    org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
    org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
    org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:464)
    org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:206)
    org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:172)
    org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
{noformat}

2) The LLAP client received the end of the data stream and is expecting a heartbeat with a
task complete notification:
{noformat}
05:47:44,060 DEBUG org.apache.hadoop.hive.llap.ext.LlapTaskUmbilicalExternalClient: Heartbeat
from attempt_7085310350540683826_0089_0_00_000033_0 events: 1
2017-06-29 05:47:44,060 DEBUG org.apache.hadoop.hive.llap.ext.LlapTaskUmbilicalExternalClient:
Task update event for attempt_7085310350540683826_0089_0_00_000033_0
2017-06-29 05:47:44,065 DEBUG org.apache.hadoop.hive.llap.io.ChunkedInputStream: LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0):
Chunk size 131072
2017-06-29 05:47:44,081 DEBUG org.apache.hadoop.hive.llap.io.ChunkedInputStream: LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0):
Chunk size 131072
2017-06-29 05:47:44,097 DEBUG org.apache.hadoop.hive.llap.io.ChunkedInputStream: LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0):
Chunk size 131072
2017-06-29 05:47:44,119 DEBUG org.apache.hadoop.hive.llap.io.ChunkedInputStream: LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0):
Chunk size 30244
2017-06-29 05:47:44,123 DEBUG org.apache.hadoop.hive.llap.io.ChunkedInputStream: LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0):
Chunk size 0
2017-06-29 05:47:44,123 DEBUG org.apache.hadoop.hive.llap.io.ChunkedInputStream: LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0):
Hit end of data
2017-06-29 05:47:44,123 INFO org.apache.hadoop.hive.llap.LlapBaseRecordReader: 1498729664123
Waiting for reader event for LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0)
{noformat}

Due to issue 1 the task completed event never arrives at the client, though the client continues
to receive heartbeats from LLAP. Eventually (after 30 seconds), the external client times
out waiting for the task completed event. I'm guessing the solution on the client side is
that we shouldn't be timing out while waiting for the task complete event as long as we are
still receiving hearbeats. I'll try setting it to an indefinite wait.
{noformat}
2017-06-29 05:48:14,106 DEBUG org.apache.hadoop.hive.llap.ext.LlapTaskUmbilicalExternalClient:
Received heartbeat from container, request={  containerId=container_7085310350540683826_0089_00_000033,
requestId=487, startIndex=0, preRoutedStartIndex=0, maxEventsToGet=500, taskAttemptId=attempt_7085310350540683826_0089_0_00_000033_0,
eventCount=1 }
2017-06-29 05:48:14,106 DEBUG org.apache.hadoop.hive.llap.ext.LlapTaskUmbilicalExternalClient:
Heartbeat from attempt_7085310350540683826_0089_0_00_000033_0 events: 1
2017-06-29 05:48:14,106 DEBUG org.apache.hadoop.hive.llap.ext.LlapTaskUmbilicalExternalClient:
Task update event for attempt_7085310350540683826_0089_0_00_000033_0
2017-06-29 05:48:14,123 INFO org.apache.hadoop.hive.llap.LlapBaseRecordReader: 1498729694123
Timed out waiting for reader event for LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0),
timeout=30000 ms
2017-06-29 05:48:14,124 DEBUG org.apache.hadoop.hive.llap.io.ChunkedInputStream: LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0):
Closing chunked input.
java.io.IOException: Timed out getting readerEvents for client LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0)
2017-06-29 05:48:14,128 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 33.0 in
stage 0.0 (TID 33, localhost, executor driver): java.io.IOException: Timed out getting readerEvents
for client LlapTaskUmbilicalExternalClient(attempt_7085310350540683826_0089_0_00_000033_0)
{noformat}

> "Timed out getting readerEvents" error from external LLAP client
> ----------------------------------------------------------------
>
>                 Key: HIVE-17091
>                 URL: https://issues.apache.org/jira/browse/HIVE-17091
>             Project: Hive
>          Issue Type: Sub-task
>          Components: llap
>            Reporter: Jason Dere
>            Assignee: Jason Dere
>
> {noformat}
> Caused by: java.io.IOException: Timed out getting readerEvents
>   at org.apache.hadoop.hive.llap.LlapBaseRecordReader.getReaderEvent(LlapBaseRecordReader.java:261)
>   at org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:148)
>   at org.apache.hadoop.hive.llap.LlapBaseRecordReader.next(LlapBaseRecordReader.java:48)
>   at org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:121)
>   at org.apache.hadoop.hive.llap.LlapRowRecordReader.next(LlapRowRecordReader.java:68)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message