spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-20230) FetchFailedExceptions should invalidate file caches in MapOutputTracker even if newer stages are launched
Date Wed, 05 Apr 2017 19:17:41 GMT

     [ https://issues.apache.org/jira/browse/SPARK-20230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-20230:
------------------------------------

    Assignee: Apache Spark

> FetchFailedExceptions should invalidate file caches in MapOutputTracker even if newer
stages are launched
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20230
>                 URL: https://issues.apache.org/jira/browse/SPARK-20230
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Burak Yavuz
>            Assignee: Apache Spark
>
> If you lose instances that have shuffle outputs, you will start observing messages like:
> {code}
> 17/03/24 11:49:23 WARN TaskSetManager: Lost task 0.0 in stage 64.1 (TID 3849, 172.128.196.240,
executor 0): FetchFailed(BlockManagerId(4, 172.128.200.157, 4048, None), shuffleId=16, mapId=2,
reduceId=3, message=
> {code}
> Generally, these messages are followed by:
> {code}
> 17/03/24 11:49:23 INFO DAGScheduler: Executor lost: 4 (epoch 20)
> 17/03/24 11:49:23 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
> 17/03/24 11:49:23 INFO BlockManagerMaster: Removed 4 successfully in removeExecutor
> 17/03/24 11:49:23 INFO DAGScheduler: Shuffle files lost for executor: 4 (epoch 20)
> 17/03/24 11:49:23 INFO ShuffleMapStage: ShuffleMapStage 63 is now unavailable on executor
4 (73/89, false)
> {code}
> which is great. Spark resubmits tasks for data that has been lost. However, if you have
cascading instance failures, then you may come across:
> {code}
> 17/03/24 11:48:39 INFO DAGScheduler: Ignoring fetch failure from ResultTask(64, 46) as
it's from ResultStage 64 attempt 0 and there is a more recent attempt for that stage (attempt
ID 1) running
> {code}
> which don't invalidate file outputs. In later retries of the stage, Spark will attempt
to access files on machines that don't exist anymore, and then after 4 tries, Spark will give
up. If it had not ignored the fetch failure, and invalidated the cache, then most of the lost
files could have been computed during one of the previous retries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message