spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JoshRosen <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Date Tue, 16 Dec 2014 03:13:11 GMT
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2848#issuecomment-67106447
  
    Disclaimer: for iterative debugging, I use `sbt` to build Spark, not Maven.  Spark 1.2.0
has docs on building with SBT.  If possible, I'd switch to using that workflow.
    
    The issue here is probably that you're running a full `mvn clean` and starting over from
scratch after each change.  I'd like to help move this PR along, so the following is going
to be some interactive logs to see whether I can quickly iterate on this using Maven.
    
    Let's say that I'm starting from a completely cold build (but with Maven dependencies
already downloaded):
    
    ```bash
    # Since I have zinc installed, I'll use it:
    zinc -start
    git checkout /a/branch/with/your/pr/code
    # Here, the -T C1 says "build in parallel with one thread per core":
    time mvn -T C1 clean package -DskipTests
    ```
    
    This didn't take _super_ long, but it was a few minutes:
    
    ```
    real	4m19.537s
    user	3m14.634s
    sys	0m16.882s
    ```
    
    Let's run just the test suite that we're interested in ([instructions from here](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools):
    
    ```
    time mvn -T C1 test -DwildcardSuites=org.apache.spark.FileServerSuite
    ```
    
    This took a little while because it had to build a bunch of test sources, but it was only
a few seconds before the tests started running (and failing):
    
    ```
    real	0m33.968s
    user	0m36.544s
    sys	0m3.032s
    
    FileServerSuite:
    - Distributing files locally
    - Distributing files locally security On *** FAILED ***
      java.io.FileNotFoundException: /var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt
(No such file or directory)
      at java.io.FileInputStream.open(Native Method)
      at java.io.FileInputStream.<init>(FileInputStream.java:146)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
      at com.google.common.io.ByteSource.copyTo(ByteSource.java:202)
      at com.google.common.io.Files.copy(Files.java:436)
      at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72)
      at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55)
      at org.apache.spark.SparkContext.addFile(SparkContext.scala:965)
      at org.apache.spark.FileServerSuite$$anonfun$3.apply$mcV$sp(FileServerSuite.scala:96)
      ...
    - Distributing files locally using URL as input *** FAILED ***
      java.io.FileNotFoundException: /var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt
(No such file or directory)
      at java.io.FileInputStream.open(Native Method)
      at java.io.FileInputStream.<init>(FileInputStream.java:146)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
      at com.google.common.io.ByteSource.copyTo(ByteSource.java:202)
      at com.google.common.io.Files.copy(Files.java:436)
      at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72)
      at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55)
      at org.apache.spark.SparkContext.addFile(SparkContext.scala:965)
      at org.apache.spark.FileServerSuite$$anonfun$5.apply$mcV$sp(FileServerSuite.scala:112)
      ...
     - Dynamically adding JARS locally
    - Distributing files on a standalone cluster *** FAILED ***
      java.io.FileNotFoundException: /var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt
(No such file or directory)
      at java.io.FileInputStream.open(Native Method)
      at java.io.FileInputStream.<init>(FileInputStream.java:146)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
      at com.google.common.io.ByteSource.copyTo(ByteSource.java:202)
      at com.google.common.io.Files.copy(Files.java:436)
      at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72)
      at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55)
      at org.apache.spark.SparkContext.addFile(SparkContext.scala:965)
      at org.apache.spark.FileServerSuite$$anonfun$8.apply$mcV$sp(FileServerSuite.scala:137)
      ...
    - Dynamically adding JARS on a standalone cluster
    - Dynamically adding JARS on a standalone cluster using local: URL
    ```
    
    Let's try adding a print statement to `SparkContext.addFile`, then re-running the tests.
 We could do this by re-packaging:
    
    ```
    mvn -T C1 package -DskipTests
    ```
    
    ```
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 01:24 min (Wall Clock)
    [INFO] Finished at: 2014-12-15T18:47:00-08:00
    [INFO] Final Memory: 50M/1535M
    ```
    
    And to re-run the test:
    
    ```
    time mvn -T C1 test -DwildcardSuites=org.apache.spark.FileServerSuite
    ```
    
    ```
    real	0m40.149s
    user	0m35.894s
    sys	0m3.163s
    ```
    
    So this took a couple of minutes.
    
    Let's do the same thing in SBT.  First, let's start off from a completely clean-slate:
    
    ```
    time sbt/sbt clean package assembly
    ```
    
    The timing here could be messed up because I have a multi-core machine:
    
    ```
    real	3m53.643s
    user	8m22.337s
    sys	1m15.794s
    ```
    
    Next, let's run just the suite we're interested in.  Here's a naive way to do this, which
involves building every test, so this first run will take longer than subsequent runs:
    
    ```
    time sbt/sbt "test-only FileServerSuite"
    ```
    
    ```
    [success] Total time: 88 s, completed Dec 15, 2014 6:56:35 PM
    
    real	1m39.013s
    user	7m37.323s
    sys	0m16.206s
    ```
    
    Whoops, I made a mistake here!  My `test-only` pattern didn't include a wildcard, so `FileServerSuite`
didn't match the fully-qualified name of a test suite.  Let me go ahead and re-run with the
right command:
    
    ```
    time sbt/sbt "test-only *FileServerSuite"
    ```
    
    This was pretty fast:
    
    ```
    real	0m29.075s
    user	0m50.744s
    sys	0m3.512s
    ```
    
    I could also have run this from the interactive shell to get automatic rebuilding on source
changes.
    
    There's an even faster way of running just FileServerSuite, though: I can tell SBT to
only build / run the `core` module.  This time, let's do this interactively, but from a clean
slate:
    
    ```scala
    [info] Set current project to spark-parent (in build file:/Users/joshrosen/Documents/spark/)
    > clean
    [success] Total time: 20 s, completed Dec 15, 2014 7:01:20 PM
    > project core
    [info] Set current project to spark-core (in build file:/Users/joshrosen/Documents/spark/)
    > package
    [...]
    [info] Compiling 42 Java sources to /Users/joshrosen/Documents/spark/network/common/target/scala-2.10/classes...
    [info] Compiling 20 Java sources to /Users/joshrosen/Documents/spark/network/shuffle/target/scala-2.10/classes...
    [info] Compiling 397 Scala sources and 33 Java sources to /Users/joshrosen/Documents/spark/core/target/scala-2.10/classes...
    [...]
    [info] Packaging /Users/joshrosen/Documents/spark/core/target/scala-2.10/spark-core_2.10-1.3.0-SNAPSHOT.jar
...
    [info] Done packaging.
    [success] Total time: 64 s, completed Dec 15, 2014 7:02:36 PM
    > test-only *FileServerSuite
    [...]
    [info] Compiling 124 Scala sources and 4 Java sources to /Users/joshrosen/Documents/spark/core/target/scala-2.10/test-classes...
    [...]
    [---- tests run ----]
    [---- tests go into infinite loop ----]
    ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory
    ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory
    ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory
    ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory
    [... infinite repetitions ...]
    ```
    
    Hmm, so it looks like the tests that rely on `local-cluster` mode need to have access
to a Spark assembly JAR in order to run, and that there's mis-handling of this error condition
somewhere (hence the infinite loop).  This is pretty annoying, so I guess I'll build an assembly
once then use `export SPARK_PREPEND_CLASSES=true` so that I don't have to keep re-building
it across test runs:
    
    ```
    export SPARK_PREPEND_CLASSES=true
    sbt/sbt assembly/assembly
    sbt/sbt
    ```
    
    Now, from the SBT shell:
    
    ```scala
    > project core
    [info] Set current project to spark-core (in build file:/Users/joshrosen/Documents/spark/)
    > ~test-only *FileServerSuite
    [... tests run ...]
    [info] *** 3 TESTS FAILED ***
    [error] Failed: Total 7, Failed 3, Errors 0, Passed 4
    [error] Failed tests:
    [error] 	org.apache.spark.FileServerSuite
    [error] (core/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
    [error] Total time: 18 s, completed Dec 15, 2014 7:10:57 PM
    1. Waiting for source changes... (press enter to interrupt)
    [... add a println to Utils.addFile ...]
    [... tests start up almost instantly and run ...]
    [info] *** 3 TESTS FAILED ***
    [error] Failed: Total 7, Failed 3, Errors 0, Passed 4
    [error] Failed tests:
    [error] 	org.apache.spark.FileServerSuite
    [error] (core/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
    [error] Total time: 19 s, completed Dec 15, 2014 7:11:46 PM
    ```
    
    So, to summarize: I agree that there are a bunch of pain-points in the current build process.
 Day-to-day, though, it hasn't affected me that much since I'll usually `sbt/sbt clean package
assembly` and `export SPARK_PREPEND_CLASSES=true` once at the beginning of the day then keep
working in my SBT shell, where incremental recompilation means that I can make changes in
my IDE and see the failing test update (almost) instantly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message