spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: Spark <--> S3 flakiness
Date Thu, 18 May 2017 09:58:34 GMT

On 18 May 2017, at 05:29,<> wrote:

Steve, just to clarify:

"FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on
high-performance reads, especially if you are working with column data and can set the fs.s3a.experimental.fadvise=random
option. "

Are you talking about the hadoop-aws lib or hadoop itself.  I see that spark is currently
only pre-built against hadoop 2.7.

all the hadoop JARs. It's a big move, and really I'd hold off for it except in the special
case : spark standalone on my desktop

Most of our failures are on write, the other fix I've seen advertised has been: "fileoutputcommitter.algorithm.version=2"

this eliminates the big rename() in job commit, renaming the work of individual tasks at the
end of each task commit.

It doesn't do anything for problems writing data, and it still has a fundamental flaw: to
rename everything in a "directory tree", you need to be able to list all objects under a path,
which utterly depends on consistent directory listings. Amazon S3 doesn't offer that: you
can create a file, then list the bucket *and not see the file*. Similarly, after deletion
it may be listed, but not be there any more. Without that consistent listing, you don't get
reliable renames, hence output.

It's possible that you may not even notice the fact that data hasn't been copied over.

Ryan's committer avoids this problem by using the local filesystem and HDFS cluster as the
consistent stores, and using uncompleted S3A multipart uploads to eliminate the rename at
the end

see also:

Still doing some reading and will start testing in the next day or so.



On 17 May 2017 at 03:19, Steve Loughran <<>>

On 17 May 2017, at 06:00,<> wrote:

Steve, thanks for the reply.  Digging through all the documentation now.

Much appreciated!

FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance
reads, especially if you are working with column data and can set the fs.s3a.experimental.fadvise=random

That's in apache Hadoop 2.8, HDP 2.5+, and I suspect also the latest versions of CDH, even
if their docs don't mention it

On 16 May 2017 at 10:10, Steve Loughran <<>>

On 11 May 2017, at 06:07,<> wrote:

Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps
and final output of parquet files.

Please don't, not without a committer specially written to work against S3 in the presence
of failures.You are at risk of things going wrong and you not even noticing.

The only one that I trust to do this right now is;

see also :

We're running into the following issues on a semi regular basis:
* These are intermittent errors, IE we have about 300 jobs that run nightly... And a fairly
random but small-ish percentage of them fail with the following classes of errors.

S3 write errors

"ERROR Utils: Aborting task Status Code: 404, AWS Service: Amazon S3,
AWS Request ID: 2D3RP, AWS Error Code: null, AWS Error Message: Not Found, S3 Extended Request
ID: BlaBlahEtc="

"Py4JJavaError: An error occurred while calling o43.parquet.
: Status Code: 0, AWS Service:
null, AWS Request ID: null, AWS Error Code: null, AWS Error Message: One or more objects could
not be deleted, S3 Extended Request ID: null"

S3 Read Errors:

[Stage 1:=================================================>       (27 + 4) / 31]17/05/10
16:25:23 ERROR Executor: Exception in task 10.0 in stage 1.0 (TID 11) Connection reset
at org.apache.http.conn.BasicManagedEntity.streamClosed(
at org.apache.http.conn.EofSensorInputStream.checkClose(
at org.apache.http.conn.EofSensorInputStream.close(
at org.apache.hadoop.fs.s3a.S3AInputStream.close(

We have literally tons of logs we can add but it would make the email unwieldy big.  If it
would be helpful I'll drop them in a pastebin or something.

Our config is along the lines of:

  *   spark-2.1.0-bin-hadoop2.7
  *   '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

You should have the Hadoop 2.7 JARs on your CP, as s3a on 2.6 wasn't ready to play with. In
particular, in a close() call it reads to the end of the stream, which is a performance killer
on large files. That stack trace you see is from that same phase of operation, so should go
away too.

Hadoop 2.7.3 depends on Amazon SDK 1.7.4; trying to use a different one will probably cause
link errors.

Also: make sure Joda time >= 2.8.1 for Java 8

If you go up to 2.8.0, and you still see the errors, file something against HADOOP in JIRA

Given the stack overflow / googling I've been doing I know we're not the only org with these
issues but I haven't found a good set of solutions in those spaces yet.


Gary Lucas

View raw message