Return-Path: X-Original-To: apmail-spark-reviews-archive@minotaur.apache.org Delivered-To: apmail-spark-reviews-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D8B16C48A for ; Tue, 16 Dec 2014 03:13:18 +0000 (UTC) Received: (qmail 82558 invoked by uid 500); 16 Dec 2014 03:13:13 -0000 Delivered-To: apmail-spark-reviews-archive@spark.apache.org Received: (qmail 82552 invoked by uid 500); 16 Dec 2014 03:13:13 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 82541 invoked by uid 99); 16 Dec 2014 03:13:12 -0000 Received: from tyr.zones.apache.org (HELO tyr.zones.apache.org) (140.211.11.114) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Dec 2014 03:13:12 +0000 Received: by tyr.zones.apache.org (Postfix, from userid 65534) id 1D3F69C9892; Tue, 16 Dec 2014 03:13:12 +0000 (UTC) From: JoshRosen To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org References: In-Reply-To: Subject: [GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe... Content-Type: text/plain Message-Id: <20141216031312.1D3F69C9892@tyr.zones.apache.org> Date: Tue, 16 Dec 2014 03:13:11 +0000 (UTC) Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-67106447 Disclaimer: for iterative debugging, I use `sbt` to build Spark, not Maven. Spark 1.2.0 has docs on building with SBT. If possible, I'd switch to using that workflow. The issue here is probably that you're running a full `mvn clean` and starting over from scratch after each change. I'd like to help move this PR along, so the following is going to be some interactive logs to see whether I can quickly iterate on this using Maven. Let's say that I'm starting from a completely cold build (but with Maven dependencies already downloaded): ```bash # Since I have zinc installed, I'll use it: zinc -start git checkout /a/branch/with/your/pr/code # Here, the -T C1 says "build in parallel with one thread per core": time mvn -T C1 clean package -DskipTests ``` This didn't take _super_ long, but it was a few minutes: ``` real 4m19.537s user 3m14.634s sys 0m16.882s ``` Let's run just the test suite that we're interested in ([instructions from here](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools): ``` time mvn -T C1 test -DwildcardSuites=org.apache.spark.FileServerSuite ``` This took a little while because it had to build a bunch of test sources, but it was only a few seconds before the tests started running (and failing): ``` real 0m33.968s user 0m36.544s sys 0m3.032s FileServerSuite: - Distributing files locally - Distributing files locally security On *** FAILED *** java.io.FileNotFoundException: /var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at com.google.common.io.Files$FileByteSource.openStream(Files.java:124) at com.google.common.io.Files$FileByteSource.openStream(Files.java:114) at com.google.common.io.ByteSource.copyTo(ByteSource.java:202) at com.google.common.io.Files.copy(Files.java:436) at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72) at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55) at org.apache.spark.SparkContext.addFile(SparkContext.scala:965) at org.apache.spark.FileServerSuite$$anonfun$3.apply$mcV$sp(FileServerSuite.scala:96) ... - Distributing files locally using URL as input *** FAILED *** java.io.FileNotFoundException: /var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at com.google.common.io.Files$FileByteSource.openStream(Files.java:124) at com.google.common.io.Files$FileByteSource.openStream(Files.java:114) at com.google.common.io.ByteSource.copyTo(ByteSource.java:202) at com.google.common.io.Files.copy(Files.java:436) at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72) at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55) at org.apache.spark.SparkContext.addFile(SparkContext.scala:965) at org.apache.spark.FileServerSuite$$anonfun$5.apply$mcV$sp(FileServerSuite.scala:112) ... - Dynamically adding JARS locally - Distributing files on a standalone cluster *** FAILED *** java.io.FileNotFoundException: /var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at com.google.common.io.Files$FileByteSource.openStream(Files.java:124) at com.google.common.io.Files$FileByteSource.openStream(Files.java:114) at com.google.common.io.ByteSource.copyTo(ByteSource.java:202) at com.google.common.io.Files.copy(Files.java:436) at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72) at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55) at org.apache.spark.SparkContext.addFile(SparkContext.scala:965) at org.apache.spark.FileServerSuite$$anonfun$8.apply$mcV$sp(FileServerSuite.scala:137) ... - Dynamically adding JARS on a standalone cluster - Dynamically adding JARS on a standalone cluster using local: URL ``` Let's try adding a print statement to `SparkContext.addFile`, then re-running the tests. We could do this by re-packaging: ``` mvn -T C1 package -DskipTests ``` ``` [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:24 min (Wall Clock) [INFO] Finished at: 2014-12-15T18:47:00-08:00 [INFO] Final Memory: 50M/1535M ``` And to re-run the test: ``` time mvn -T C1 test -DwildcardSuites=org.apache.spark.FileServerSuite ``` ``` real 0m40.149s user 0m35.894s sys 0m3.163s ``` So this took a couple of minutes. Let's do the same thing in SBT. First, let's start off from a completely clean-slate: ``` time sbt/sbt clean package assembly ``` The timing here could be messed up because I have a multi-core machine: ``` real 3m53.643s user 8m22.337s sys 1m15.794s ``` Next, let's run just the suite we're interested in. Here's a naive way to do this, which involves building every test, so this first run will take longer than subsequent runs: ``` time sbt/sbt "test-only FileServerSuite" ``` ``` [success] Total time: 88 s, completed Dec 15, 2014 6:56:35 PM real 1m39.013s user 7m37.323s sys 0m16.206s ``` Whoops, I made a mistake here! My `test-only` pattern didn't include a wildcard, so `FileServerSuite` didn't match the fully-qualified name of a test suite. Let me go ahead and re-run with the right command: ``` time sbt/sbt "test-only *FileServerSuite" ``` This was pretty fast: ``` real 0m29.075s user 0m50.744s sys 0m3.512s ``` I could also have run this from the interactive shell to get automatic rebuilding on source changes. There's an even faster way of running just FileServerSuite, though: I can tell SBT to only build / run the `core` module. This time, let's do this interactively, but from a clean slate: ```scala [info] Set current project to spark-parent (in build file:/Users/joshrosen/Documents/spark/) > clean [success] Total time: 20 s, completed Dec 15, 2014 7:01:20 PM > project core [info] Set current project to spark-core (in build file:/Users/joshrosen/Documents/spark/) > package [...] [info] Compiling 42 Java sources to /Users/joshrosen/Documents/spark/network/common/target/scala-2.10/classes... [info] Compiling 20 Java sources to /Users/joshrosen/Documents/spark/network/shuffle/target/scala-2.10/classes... [info] Compiling 397 Scala sources and 33 Java sources to /Users/joshrosen/Documents/spark/core/target/scala-2.10/classes... [...] [info] Packaging /Users/joshrosen/Documents/spark/core/target/scala-2.10/spark-core_2.10-1.3.0-SNAPSHOT.jar ... [info] Done packaging. [success] Total time: 64 s, completed Dec 15, 2014 7:02:36 PM > test-only *FileServerSuite [...] [info] Compiling 124 Scala sources and 4 Java sources to /Users/joshrosen/Documents/spark/core/target/scala-2.10/test-classes... [...] [---- tests run ----] [---- tests go into infinite loop ----] ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory [... infinite repetitions ...] ``` Hmm, so it looks like the tests that rely on `local-cluster` mode need to have access to a Spark assembly JAR in order to run, and that there's mis-handling of this error condition somewhere (hence the infinite loop). This is pretty annoying, so I guess I'll build an assembly once then use `export SPARK_PREPEND_CLASSES=true` so that I don't have to keep re-building it across test runs: ``` export SPARK_PREPEND_CLASSES=true sbt/sbt assembly/assembly sbt/sbt ``` Now, from the SBT shell: ```scala > project core [info] Set current project to spark-core (in build file:/Users/joshrosen/Documents/spark/) > ~test-only *FileServerSuite [... tests run ...] [info] *** 3 TESTS FAILED *** [error] Failed: Total 7, Failed 3, Errors 0, Passed 4 [error] Failed tests: [error] org.apache.spark.FileServerSuite [error] (core/test:testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 18 s, completed Dec 15, 2014 7:10:57 PM 1. Waiting for source changes... (press enter to interrupt) [... add a println to Utils.addFile ...] [... tests start up almost instantly and run ...] [info] *** 3 TESTS FAILED *** [error] Failed: Total 7, Failed 3, Errors 0, Passed 4 [error] Failed tests: [error] org.apache.spark.FileServerSuite [error] (core/test:testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 19 s, completed Dec 15, 2014 7:11:46 PM ``` So, to summarize: I agree that there are a bunch of pain-points in the current build process. Day-to-day, though, it hasn't affected me that much since I'll usually `sbt/sbt clean package assembly` and `export SPARK_PREPEND_CLASSES=true` once at the beginning of the day then keep working in my SBT shell, where incremental recompilation means that I can make changes in my IDE and see the failing test update (almost) instantly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org