spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "xukun (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
Date Tue, 14 Feb 2017 11:50:41 GMT

    [ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865644#comment-15865644
] 

xukun commented on SPARK-18113:
-------------------------------

[~aash]

Thanks for your reply. I think [https://github.com/palantir/spark/pull/94] can not resolve
this scenario.
{code}
driver:          starting task 678.0 ---- Lost executor ---- retry 678.1 ---- Authorizing
attemptNumber=0,partition=678 --- Denying attemptNumber=1,partition=678 -- deadloop

executor:     Running task 678.0 ------------------------------------------- --outputCommitCoordinator.canCommit
---- finished task 678.0 ---- RECEIVED SIGNAL 15: 
{code}



> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-18113
>                 URL: https://issues.apache.org/jira/browse/SPARK-18113
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.0.1
>         Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>            Reporter: xuqing
>            Assignee: jin xing
>             Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry another sending.
Driver receives 2 AskPermissionToCommitOutput messages and handles them. But executor ignores
the first response(true) and receives the second response(false). The TaskAttemptNumber for
this partition in authorizedCommittersByStage is locked forever. Driver enters into infinite
loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com,
partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com):
TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, stage: 2,
partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 119, cwss04.sh01.com,
partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, cwss04.sh01.com):
TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, stage: 2,
partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 (TID 28733,
cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = AskPermissionToCommitOutput(2,24,0)]
in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This
timeout is controlled by spark.rpc.askTimeout
>         at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>         at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>         at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>         at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>         at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>         at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>         at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
>         at org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
>         at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
>         at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
>         at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
>         at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>         at org.apache.spark.scheduler.Task.run(Task.scala:86)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:785)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
>         at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>         at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>         at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:190)
>         at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
>         ... 13 more
> ...
> 16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119)
> ...
> 16/10/25 05:39:24 INFO SparkHadoopMapRedUtil: attempt_201610250536_0002_m_000024_119:
Not committed because the driver did not authorize commit
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message