spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juho Autio <juho.au...@rovio.com>
Subject Question about SaveMode.Ignore behaviour
Date Thu, 09 May 2019 13:28:31 GMT
Does spark handle 'ignore' mode on file level or partition level?


My code is like this:

    df.write \
        .option('mapreduce.fileoutputcommitter.algorithm.version', '2') \
        .mode('ignore') \
        .partitionBy('p') \
        .orc(target_path)

When I used mode('append') my job sometimes fails
with FileAlreadyExistsException when a failed task is retried. So I would
like to skip writing if the file by the same name exists already, and if it
doesn't exist, write the file, of course.


Here's an example scenario that I'm not sure about:

Let's say I have parallelism=2, and there's data in both splits, going to
the same output partition. So the expected result after the job has run is
that two files were created:

target_path/p=1/output-abc-1.orc
target_path/p=1/output-abc-2.orc

I know that this works in the normal case. But is there something special
to know in case of any failed task attempts?

For example, consider that the first task tries to write, but the task
fails due to some reason, although the file is successfully created at:

target_path/p=1/output-abc-1.orc

Then the task is retried, but it turns out that actually the file was
already written. So writing it is ignored this time. All good so far.

Then there's another task that should write the 2nd file (out of 2):

target_path/p=1/output-abc-2.orc

Is this file written, or would it be ignored because target_path/p=1/
already exists?


Extract from Spark docs:

> Ignore mode means that when saving a DataFrame to a data source, if data
already exists, the save operation is expected to not save the contents of
the DataFrame and to not change the existing data

So here the meaning of "data" is a bit ambiguous. I've read that for sql
table writing with SaveMode.Ignore writing would be skipped entirely if the
table exists. It seems like it has a different meaning for a regular
df.write?


Thanks!

Mime
View raw message