spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miles Crawford (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-14209) Application failure during preemption.
Date Fri, 01 Apr 2016 21:47:25 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222389#comment-15222389
] 

Miles Crawford commented on SPARK-14209:
----------------------------------------

Can you be a bit more specific about the default log configuration from spark?

All we're doing in terms of logging is placing a logback.xml file into our classpath that
sets the console logger to level INFO...

> Application failure during preemption.
> --------------------------------------
>
>                 Key: SPARK-14209
>                 URL: https://issues.apache.org/jira/browse/SPARK-14209
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 1.6.1
>         Environment: Spark on YARN
>            Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle service.  When
a new job arrives, existing jobs are successfully preempted down to fit.
> A spate of these messages arrives:
> 	ExecutorLostFailure (executor 48 exited unrelated to the running tasks) Reason: Container
container_1458935819920_0019_01_000143 on host: ip-10-12-46-235.us-west-2.compute.internal
was preempted.
> This seems fine - the problem is that soon thereafter, our whole application fails because
it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 locations.
Most recent failure cause:
>     Caused by: java.io.IOException: Failed to connect to ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
>         Caused by: java.net.ConnectException: Connection refused: ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over and over
until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are being handled
- shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message