On 18 May 2017, at 05:29, lucas.gary@gmail.com wrote:

Steve, just to clarify:

"FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, especially if you are working with column data and can set the fs.s3a.experimental.fadvise=random option. "

Are you talking about the hadoop-aws lib or hadoop itself.  I see that spark is currently only pre-built against hadoop 2.7.

all the hadoop JARs. It's a big move, and really I'd hold off for it except in the special case : spark standalone on my desktop

Most of our failures are on write, the other fix I've seen advertised has been: "fileoutputcommitter.algorithm.version=2" 

this eliminates the big rename() in job commit, renaming the work of individual tasks at the end of each task commit.

It doesn't do anything for problems writing data, and it still has a fundamental flaw: to rename everything in a "directory tree", you need to be able to list all objects under a path, which utterly depends on consistent directory listings. Amazon S3 doesn't offer that: you can create a file, then list the bucket *and not see the file*. Similarly, after deletion it may be listed, but not be there any more. Without that consistent listing, you don't get reliable renames, hence output.

It's possible that you may not even notice the fact that data hasn't been copied over. 

Ryan's committer avoids this problem by using the local filesystem and HDFS cluster as the consistent stores, and using uncompleted S3A multipart uploads to eliminate the rename at the end


see also: https://www.youtube.com/watch?v=8F2Jqw5_OnI&feature=youtu.be

Still doing some reading and will start testing in the next day or so.



On 17 May 2017 at 03:19, Steve Loughran <stevel@hortonworks.com> wrote:

On 17 May 2017, at 06:00, lucas.gary@gmail.com wrote:

Steve, thanks for the reply.  Digging through all the documentation now.

Much appreciated!

FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, especially if you are working with column data and can set the fs.s3a.experimental.fadvise=random option. 

That's in apache Hadoop 2.8, HDP 2.5+, and I suspect also the latest versions of CDH, even if their docs don't mention it

On 16 May 2017 at 10:10, Steve Loughran <stevel@hortonworks.com> wrote:

On 11 May 2017, at 06:07, lucas.gary@gmail.com wrote:

Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps and final output of parquet files.  

Please don't, not without a committer specially written to work against S3 in the presence of failures.You are at risk of things going wrong and you not even noticing.

The only one that I trust to do this right now is; https://github.com/rdblue/s3committer

We're running into the following issues on a semi regular basis: 
* These are intermittent errors, IE we have about 300 jobs that run nightly... And a fairly random but small-ish percentage of them fail with the following classes of errors.

S3 write errors

"ERROR Utils: Aborting task
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404, AWS Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS Error Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
"Py4JJavaError: An error occurred while calling o43.parquet.
: com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code: 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error Message: One or more objects could not be deleted, S3 Extended Request ID: null"

S3 Read Errors:

[Stage 1:=================================================>       (27 + 4) / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage 1.0 (TID 11)
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:168)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:187)

We have literally tons of logs we can add but it would make the email unwieldy big.  If it would be helpful I'll drop them in a pastebin or something.

Our config is along the lines of:
  • spark-2.1.0-bin-hadoop2.7
  • '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

You should have the Hadoop 2.7 JARs on your CP, as s3a on 2.6 wasn't ready to play with. In particular, in a close() call it reads to the end of the stream, which is a performance killer on large files. That stack trace you see is from that same phase of operation, so should go away too.

Hadoop 2.7.3 depends on Amazon SDK 1.7.4; trying to use a different one will probably cause link errors. 

Also: make sure Joda time >= 2.8.1 for Java 8

If you go up to 2.8.0, and you still see the errors, file something against HADOOP in JIRA

Given the stack overflow / googling I've been doing I know we're not the only org with these issues but I haven't found a good set of solutions in those spaces yet.


Gary Lucas