Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3A4ED200B62 for ; Fri, 29 Jul 2016 06:55:39 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 2E1B7160A94; Fri, 29 Jul 2016 04:55:39 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7352D160A85 for ; Fri, 29 Jul 2016 06:55:38 +0200 (CEST) Received: (qmail 23964 invoked by uid 500); 29 Jul 2016 04:55:37 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 23953 invoked by uid 99); 29 Jul 2016 04:55:37 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jul 2016 04:55:37 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id DB809E0100; Fri, 29 Jul 2016 04:55:36 +0000 (UTC) From: JoshRosen To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org Message-ID: Subject: [GitHub] spark pull request #14396: [SPARK-16787] SparkContext.addFile() should not t... Content-Type: text/plain Date: Fri, 29 Jul 2016 04:55:36 +0000 (UTC) archived-at: Fri, 29 Jul 2016 04:55:39 -0000 GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/14396 [SPARK-16787] SparkContext.addFile() should not throw if called twice with the same file ## What changes were proposed in this pull request? The behavior of `SparkContext.addFile()` changed slightly with the introduction of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it was disabled by default) and became the default / only file server in Spark 2.0.0. Prior to 2.0, calling `SparkContext.addFile()` with files that have the same name and identical contents would succeed. This behavior was never explicitly documented but Spark has behaved this way since very early 1.x versions. In 2.0 (or 1.6 with the Netty file server enabled), the second `addFile()` call will fail with a requirement error because NettyStreamManager tries to guard against duplicate file registration. This problem also affects `addJar()` in a more subtle way: the `fileServer.addJar()` call will also fail with an exception but that exception is logged and ignored; I believe that the problematic exception-catching path was mistakenly copied from some old code which was only relevant to very old versions of Spark and YARN mode. I believe that this change of behavior was unintentional, so this patch weakens the `require` check so that adding the same filename at the same path will succeed. At file download time, Spark tasks will fail with exceptions if an executor already has a local copy of a file and that file's contents do not match the contents of the file being downloaded / added. As a result, it's important that we prevent files with the same name and different contents from being served because allowing that can effectively brick an executor by preventing it from successfully launching any new tasks. Before this patch's change, this was prevented by forbidding `addFile()` from being called twice on files with the same name. Because Spark does not defensively copy local files that are passed to `addFile` it is vulnerable to files' contents changing, so I think it's okay to rely on an implicit assumption that these files are intended to be immutable (since if they _are_ mutable then this can lead to either explicit task failures or implicit incorrectness (in case new executors silently get newer copies of the file while old executors continue to use an older v ersion)). To guard against this, I have decided to only update the file addition timestamps on the first call to `addFile()`; duplicate calls will succeed but will not update the timestamp. This behavior is fine as long as we assume files are immutable, which seems reasonable given the behaviors described above. As part of this change, I also improved the thread-safety of the `addedJars` and `addedFiles` maps; this is important because these maps may be concurrently read by a task launching thread and written by a driver thread in case the user's driver code is multi-threaded. ## How was this patch tested? I added regression tests in `SparkContextSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark SPARK-16787 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14396.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14396 ---- commit 99d9855c109233bbeb4a3501041e4ecf4825c278 Author: Josh Rosen Date: 2016-07-29T01:36:24Z Add failing regression test. commit 3ecdb88f52fa21dc3b8c89b27a82a33e6d2d937d Author: Josh Rosen Date: 2016-07-29T01:36:34Z Remove catch case which masked error for addJar. commit c412f991c145cf02affd7c2d38de016a9b7f548d Author: Josh Rosen Date: 2016-07-29T01:56:49Z Fix bug. commit b98d1492ec6a6cea4e5a8cca8e1e23fee2c120e9 Author: Josh Rosen Date: 2016-07-29T02:54:34Z Add back require but weaken it to only accept identical paths. commit 0d7dd0d12ace12bf54c3d6b62b802ffa4de800a3 Author: Josh Rosen Date: 2016-07-29T04:32:03Z Improve thread-safety; do not update timestamp for file under assumption of immutability. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org