Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D16F4200B81 for ; Tue, 13 Sep 2016 22:09:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D002C160AAA; Tue, 13 Sep 2016 20:09:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 12260160AD3 for ; Tue, 13 Sep 2016 22:09:21 +0200 (CEST) Received: (qmail 69931 invoked by uid 500); 13 Sep 2016 20:09:21 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 69835 invoked by uid 99); 13 Sep 2016 20:09:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2016 20:09:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D7BFF2C1B84 for ; Tue, 13 Sep 2016 20:09:20 +0000 (UTC) Date: Tue, 13 Sep 2016 20:09:20 +0000 (UTC) From: "Apache Spark (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (SPARK-17484) Race condition when cancelling a job during a cache write can lead to block fetch failures MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 13 Sep 2016 20:09:23 -0000 [ https://issues.apache.org/jira/browse/SPARK-17484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17484: ------------------------------------ Assignee: Josh Rosen (was: Apache Spark) > Race condition when cancelling a job during a cache write can lead to block fetch failures > ------------------------------------------------------------------------------------------ > > Key: SPARK-17484 > URL: https://issues.apache.org/jira/browse/SPARK-17484 > Project: Spark > Issue Type: Improvement > Components: Block Manager > Affects Versions: 2.0.0 > Reporter: Josh Rosen > Assignee: Josh Rosen > > On a production cluster, I observed the following weird behavior where a block manager cached a block, the store failed due to a task being killed / cancelled, and then a subsequent task incorrectly attempted to read the cached block from the machine where the write failed, eventually leading to a complete job failure. > Here's the executor log snippet from the machine performing the failed cache: > {code} > 16/09/06 16:10:31 INFO MemoryStore: Block rdd_25_1 stored as values in memory (estimated size 976.8 MB, free 9.8 GB) > 16/09/06 16:10:31 WARN BlockManager: Putting block rdd_25_1 failed > 16/09/06 16:10:31 INFO Executor: Executor killed task 0.0 in stage 3.0 (TID 127) > {code} > Here's the exception from the reader in the failed job: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 46.0 failed 4 times, most recent failure: Lost task 4.3 in stage 46.0 (TID 1484, 10.69.255.197): org.apache.spark.storage.BlockFetchException: Failed to fetch block after 1 fetch failures. Most recent failure cause: > at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:565) > at org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609) > at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661) > {code} > I believe that there's a race condition in how we handle cleanup after failed cache stores. Here's an excerpt from {{BlockManager.doPut()}} > {code} > var blockWasSuccessfullyStored: Boolean = false > val result: Option[T] = try { > val res = putBody(putBlockInfo) > blockWasSuccessfullyStored = res.isEmpty > res > } finally { > if (blockWasSuccessfullyStored) { > if (keepReadLock) { > blockInfoManager.downgradeLock(blockId) > } else { > blockInfoManager.unlock(blockId) > } > } else { > blockInfoManager.removeBlock(blockId) > logWarning(s"Putting block $blockId failed") > } > } > {code} > The only way that I think this "successfully stored followed by immediately failed" case could appear in our logs is if the local memory store write succeeds and then an exception (perhaps InterruptedException) causes us to enter the {{finally}} block's error-cleanup path. The problem is that the {{finally}} block only cleans up the block's metadata rather than performing the full cleanup path which would also notify the master that the block is no longer available at this host. > The fact that the Spark task was not resilient in the face of remote block fetches is a separate issue which I'll report and fix separately. The scope of this JIRA, however, is the fact that Spark still attempted reads from a machine which was missing the block. > In order to fix this problem, I think that the {{finally}} block should perform more thorough cleanup and should send a "block removed" status update to the master following any error during the write. This is necessary because the body of {{doPut()}} may have already notified the master of block availability. In addition, we can extend the block serving code path to automatically update the master with "block deleted" statuses whenever the block manager receives invalid requests for blocks that it doesn't have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org