spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From steveloughran <...@git.apache.org>
Subject [GitHub] spark pull request #19623: [SPARK-22078][SQL] clarify exception behaviors fo...
Date Thu, 02 Nov 2017 11:21:04 GMT
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148503478
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
---
    @@ -50,28 +53,34 @@
     
       /**
        * Creates a writer factory which will be serialized and sent to executors.
    +   *
    +   * If this method fails (by throwing an exception), the action would fail and no Spark
job was
    +   * submitted.
        */
       DataWriterFactory<Row> createWriterFactory();
     
       /**
        * Commits this writing job with a list of commit messages. The commit messages are
collected from
    -   * successful data writers and are produced by {@link DataWriter#commit()}. If this
method
    -   * fails(throw exception), this writing job is considered to be failed, and
    -   * {@link #abort(WriterCommitMessage[])} will be called. The written data should only
be visible
    -   * to data source readers if this method succeeds.
    +   * successful data writers and are produced by {@link DataWriter#commit()}.
    +   *
    +   * If this method fails (by throwing an exception), this writing job is considered
to to have been
    +   * failed, and {@link #abort(WriterCommitMessage[])} would be called. The state of
the destination
    +   * is undefined and @{@link #abort(WriterCommitMessage[])} may not be able to deal
with it.
        *
        * Note that, one partition may have multiple committed data writers because of speculative
tasks.
        * Spark will pick the first successful one and get its commit message. Implementations
should be
    --- End diff --
    
    Having >1 committed writers is a failure of protocol. Speculation & failure handling
should allow >1 ready-to-commit writers, but only actually commit one. That's where the
stuff about writers reporting ready-to-commit & driver tracking state of active tasks
comes in. 
    
    What if a commit fails/commit operation times out or raises some other error? Hadoop FileOutputFormat
v1 commit can actually recover from a failed task commit, v2 can't [Assumption: rename() is
atomic and will fail if dest path exists). Hence the `OutputCommitter.isRecoverySupported()`
probe to tell driver whether or not it can recover from a task commit which is perceived as
failing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message