On 18 May 2017, at 05:29, email@example.com wrote:
Steve, just to clarify:
"FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, especially if you are working with column data and can set the fs.s3a.experimental.fadvise=ra
ndom option. "
Are you talking about the hadoop-aws lib or hadoop itself. I see that spark is currently only pre-built against hadoop 2.7.
Most of our failures are on write, the other fix I've seen advertised has been: "fileoutputcommitter.algorithm.version=2"
Still doing some reading and will start testing in the next day or so.
On 17 May 2017 at 03:19, Steve Loughran <firstname.lastname@example.org> wrote:
On 17 May 2017, at 06:00, email@example.com wrote:
Steve, thanks for the reply. Digging through all the documentation now.
FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, especially if you are working with column data and can set the fs.s3a.experimental.fadvise=
That's in apache Hadoop 2.8, HDP 2.5+, and I suspect also the latest versions of CDH, even if their docs don't mention it
On 16 May 2017 at 10:10, Steve Loughran <firstname.lastname@example.org> wrote:
On 11 May 2017, at 06:07, email@example.com wrote:
Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps and final output of parquet files.
Please don't, not without a committer specially written to work against S3 in the presence of failures.You are at risk of things going wrong and you not even noticing.
The only one that I trust to do this right now is; https://github.com/rdblue/
We're running into the following issues on a semi regular basis:* These are intermittent errors, IE we have about 300 jobs that run nightly... And a fairly random but small-ish percentage of them fail with the following classes of errors.
S3 write errors
"ERROR Utils: Aborting task
l.AmazonS3Exception: Status Code: 404, AWS Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS Error Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
"Py4JJavaError: An error occurred while calling o43.parquet.
l.MultiObjectDeleteException: Status Code: 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error Message: One or more objects could not be deleted, S3 Extended Request ID: null"
S3 Read Errors:
=====================> (27 + 4) / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage 1.0 (TID 11)
java.net.SocketException: Connection reset
ctSessionInputBuffer.read(Abst ractSessionInputBuffer.java: 198)
We have literally tons of logs we can add but it would make the email unwieldy big. If it would be helpful I'll drop them in a pastebin or something.
Our config is along the lines of:
- '--packages com.amazonaws:aws-java-sdk:1.1
0.34,org.apache.hadoop:hadoop- aws:2.6.0 pyspark-shell'
You should have the Hadoop 2.7 JARs on your CP, as s3a on 2.6 wasn't ready to play with. In particular, in a close() call it reads to the end of the stream, which is a performance killer on large files. That stack trace you see is from that same phase of operation, so should go away too.
Hadoop 2.7.3 depends on Amazon SDK 1.7.4; trying to use a different one will probably cause link errors.
Also: make sure Joda time >= 2.8.1 for Java 8
If you go up to 2.8.0, and you still see the errors, file something against HADOOP in JIRA
Given the stack overflow / googling I've been doing I know we're not the only org with these issues but I haven't found a good set of solutions in those spaces yet.