spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <>
Subject [GitHub] [spark] advancedxy commented on issue #25795: [WIP][SPARK-29037][Core] Spark gives duplicate result when an application was killed
Date Wed, 18 Sep 2019 08:05:42 GMT
advancedxy commented on issue #25795: [WIP][SPARK-29037][Core] Spark gives duplicate result
when an application was killed
   > @advancedxy can you give a completed proposal for it?
   All right, I think the requirements can be split into two parts:
   1. support concurrent writes to different locations(partitions).
       It's achieved by setting different output path for different writes:
       * For `dynamicPartitionOverwrite`, the output could be the staging dir(current solution
of #25739), which is unique from each other. 
        * For  `dynamicPartitionsOverwrite=false` and partitioned table, the output in the
`OutputCommitter` could be `$table_output/static_part_key1=value1/static_part_key2=value2/...`.
Concurrent writes to partitions prefixed by different static partitions won't interfere each
other. This could be extended in #25379. 
        * For non-partitioned table, there's only one output, don't support concurrent writes.
   2. detect concurrent writes to the same location and fail fast.
       This can be archived during `setupJob` stage. We can check the existence of output
path like the `FileOutputFormat` did. If the output path has already been existed, it must
be created by other concurrent writing job or left by previous failed/killed job. We can throw
an exception with the possible reasons and fails the current job. Of course, we cannot simple
check the output passed to JobConf as the $table_output should be presented(unless the first
time to create table). $table_output/_temporary/$app_attempt_num could be a good candidate.
      One more thing to do in Spark, spark should infer yarn app attempt num when running
under yarn mode. Currently, the app attempt num is always 0 when writing.
   I believe the approach proposal should covers concurrent writes and case in this pr. WDYT
@cloud-fan, @turboFei and @wangyum 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message