Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
From: Steve Loughran <stevel@hortonworks.com>
To: Vadim Semenov <vadim.semenov@datadoghq.com>
CC: "everett@nuna.com" <everett@nuna.com>, user <user@spark.apache.org>
Subject: Re: Spark, S3A, and 503 SlowDown / rate limit issues
Thread-Topic: Spark, S3A, and 503 SlowDown / rate limit issues
Thread-Index: AQHS8RpEmpRVgFYrU0ypL4DVUPu2IaJFu5UAgAGaVoA=
Date: Thu, 6 Jul 2017 14:08:51 +0000
Message-ID: <E16A829A-AEF2-4A55-8C9C-B531AC978414@hortonworks.com>
References: <CABc3QxG-KHX+b0X99TXACRy+6-mKZYgo_CnH+DkL2UEPKa9V=w@mail.gmail.com>
 <CA+73Emp7jWTjvPcwL2bNmvU8roBZxVvKkSp8wh_A7DcNkH7_mA@mail.gmail.com>
In-Reply-To: <CA+73Emp7jWTjvPcwL2bNmvU8roBZxVvKkSp8wh_A7DcNkH7_mA@mail.gmail.com>
Accept-Language: en-GB, en-US
Content-Language: en-US
x-ms-exchange-messagesentrepresentingtype: 1
x-ms-exchange-transport-fromentityheader: Hosted
x-source-routing-agent: Processed
MIME-Version: 1.0
X-MC-Unique: 24YMPm72OUqoV6DvvUhIXg-1
Content-Type: multipart/alternative;
	boundary="_000_E16A829AAEF24A558C9CB531AC978414hortonworkscom_"
archived-at: Thu, 06 Jul 2017 14:09:02 -0000

--_000_E16A829AAEF24A558C9CB531AC978414hortonworkscom_
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable


On 5 Jul 2017, at 14:40, Vadim Semenov <vadim.semenov@datadoghq.com<mailto:=
vadim.semenov@datadoghq.com>> wrote:

Are you sure that you use S3A?
Because EMR says that they do not support S3A

https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/
> Amazon EMR does not currently support use of the Apache Hadoop S3A file s=
ystem.

I think that the HEAD requests come from the `createBucketIfNotExists` in t=
he AWS S3 library that checks if the bucket exists every time you do a PUT =
request, i.e. creates a HEAD request.

You can disable that by setting `fs.s3.buckets.create.enabled` to `false`
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-upload-s3.ht=
ml


Yeah, I'd like to see the stack traces before blaming S3A and the ASF codeb=
ase

One thing I do know is that the shipping S3A client doesn't have any explic=
it handling of 503/retry events. I know that: https://issues.apache.org/jir=
a/browse/HADOOP-14531

There is some retry logic in bits of the AWS SDK related to file upload: th=
at may log and retry, but in all the operations listing files, getting thei=
r details, etc: no resilience to throttling.

If it is surfacing against s3a, there isn't anything which can immediately =
be done to fix it, other than "spread your data around more buckets". Do at=
tach the stack trace you get under https://issues.apache.org/jira/browse/HA=
DOOP-14381 though: I'm about half-way through the resilience code (& fault =
injection needed to test it). The more where I can see problems arise, the =
more confident I can be that those codepaths will be resilient.


On Thu, Jun 29, 2017 at 4:56 PM, Everett Anderson <everett@nuna.com.invalid=
<mailto:everett@nuna.com.invalid>> wrote:
Hi,

We're using Spark 2.0.2 + Hadoop 2.7.3 on AWS EMR with S3A for direct I/O f=
rom/to S3 from our Spark jobs. We set mapreduce.fileoutputcommitter.algorit=
hm.version=3D2 and are using encrypted S3 buckets.

This has been working fine for us, but perhaps as we've been running more j=
obs in parallel, we've started getting errors like

Status Code: 503, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Co=
de: SlowDown, AWS Error Message: Please reduce your request rate., S3 Exten=
ded Request ID: ...

We enabled CloudWatch S3 request metrics for one of our buckets and I was a=
 little alarmed to see spikes of over 800k S3 requests over a minute or so,=
 with the bulk of them HEAD requests.

We read and write Parquet files, and most tables have around 50 shards/part=
s, though some have up to 200. I imagine there's additional parallelism whe=
n reading a shard in Parquet, though.

Has anyone else encountered this? How did you solve it?

I'd sure prefer to avoid copying all our data in and out of HDFS for each j=
ob, if possible.

Thanks!


--_000_E16A829AAEF24A558C9CB531AC978414hortonworkscom_
Content-Type: text/html; charset=WINDOWS-1252
Content-ID: <A6165E1C9F8414419CE2F47E9A65CC9C@exch080.serverpod.net>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space;" class=3D"">
<br class=3D"">
<div>
<blockquote type=3D"cite" class=3D"">
<div class=3D"">On 5 Jul 2017, at 14:40, Vadim Semenov &lt;<a href=3D"mailt=
o:vadim.semenov@datadoghq.com" class=3D"">vadim.semenov@datadoghq.com</a>&g=
t; wrote:</div>
<br class=3D"Apple-interchange-newline">
<div class=3D"">
<div dir=3D"ltr" class=3D"">Are you sure that you use S3A?
<div class=3D"">Because EMR says that they do not support S3A&nbsp;</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D""><a href=3D"https://aws.amazon.com/premiumsupport/knowledge-=
center/emr-file-system-s3/" class=3D"">https://aws.amazon.com/premiumsuppor=
t/knowledge-center/emr-file-system-s3/</a><br class=3D"">
</div>
<div class=3D"">&gt;&nbsp;<span style=3D"color:rgb(51,51,51);font-family:He=
lveticaNeue,Helvetica,Helvetica,Arial,sans-serif;font-size:14px" class=3D""=
>Amazon EMR does not currently support use of the Apache Hadoop S3A file sy=
stem.</span></div>
<div class=3D""><span style=3D"color:rgb(51,51,51);font-family:HelveticaNeu=
e,Helvetica,Helvetica,Arial,sans-serif;font-size:14px" class=3D""><br class=
=3D"">
</span></div>
<div class=3D""><font color=3D"#333333" face=3D"HelveticaNeue, Helvetica, H=
elvetica, Arial, sans-serif" class=3D""><span style=3D"font-size:14px" clas=
s=3D"">I think that the HEAD requests come from the `createBucketIfNotExist=
s` in the AWS S3 library that checks if the
 bucket exists every time you do a PUT request, i.e. creates a HEAD request=
.</span></font></div>
<div class=3D""><font color=3D"#333333" face=3D"HelveticaNeue, Helvetica, H=
elvetica, Arial, sans-serif" class=3D""><span style=3D"font-size:14px" clas=
s=3D""><br class=3D"">
</span></font></div>
<div class=3D""><font color=3D"#333333" face=3D"HelveticaNeue, Helvetica, H=
elvetica, Arial, sans-serif" class=3D""><span style=3D"font-size:14px" clas=
s=3D"">You can disable that by setting `</span></font>fs.s3.buckets.create.=
enabled` to `false`&nbsp;</div>
<div class=3D""><a href=3D"http://docs.aws.amazon.com/emr/latest/Management=
Guide/emr-plan-upload-s3.html" class=3D"">http://docs.aws.amazon.com/emr/la=
test/ManagementGuide/emr-plan-upload-s3.html</a><br class=3D"">
</div>
</div>
</div>
</blockquote>
<div><br class=3D"">
</div>
<div><br class=3D"">
</div>
<div><br class=3D"">
</div>
<div>Yeah, I'd like to see the stack traces before blaming S3A and the ASF =
codebase</div>
<div><br class=3D"">
</div>
<div>One thing I do know is that the shipping S3A client doesn't have any e=
xplicit handling of 503/retry events. I know that:&nbsp;<a href=3D"https://=
issues.apache.org/jira/browse/HADOOP-14531" class=3D"">https://issues.apach=
e.org/jira/browse/HADOOP-14531</a></div>
<div><br class=3D"">
</div>
<div>There is some retry logic in bits of the AWS SDK related to file uploa=
d: that may log and retry, but in all the operations listing files, getting=
 their details, etc: no resilience to throttling.</div>
<div><br class=3D"">
</div>
<div>If it is surfacing against s3a, there isn't anything which can immedia=
tely be done to fix it, other than &quot;spread your data around more bucke=
ts&quot;. Do attach the stack trace you get under&nbsp;<a href=3D"https://i=
ssues.apache.org/jira/browse/HADOOP-14381" class=3D"">https://issues.apache=
.org/jira/browse/HADOOP-14381</a>&nbsp;though:
 I'm about half-way through the resilience code (&amp; fault injection need=
ed to test it). The more where I can see problems arise, the more confident=
 I can be that those codepaths will be resilient.</div>
<br class=3D"">
<blockquote type=3D"cite" class=3D"">
<div class=3D"">
<div dir=3D"ltr" class=3D"">
<div class=3D""><font color=3D"#333333" face=3D"HelveticaNeue, Helvetica, H=
elvetica, Arial, sans-serif" class=3D""><span style=3D"font-size:14px" clas=
s=3D""><br class=3D"">
</span></font></div>
<div class=3D"gmail_extra">
<div class=3D"gmail_quote">On Thu, Jun 29, 2017 at 4:56 PM, Everett Anderso=
n <span dir=3D"ltr" class=3D"">
&lt;<a href=3D"mailto:everett@nuna.com.invalid" target=3D"_blank" class=3D"=
">everett@nuna.com.invalid</a>&gt;</span> wrote:<br class=3D"">
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);p=
adding-left:1ex">
<div dir=3D"ltr" class=3D"">Hi,
<div class=3D""><br class=3D"">
</div>
<div class=3D"">We're using Spark 2.0.2 &#43; Hadoop 2.7.3 on AWS EMR with =
S3A for direct I/O from/to S3 from our Spark jobs. We set&nbsp;mapreduce.<w=
br class=3D"">fileoutputcommitter.algorithm.<wbr class=3D"">version=3D2 and=
 are using encrypted S3 buckets.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">This has been working fine for us, but perhaps as we've bee=
n running more jobs in parallel, we've started getting errors like</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D""><font face=3D"monospace, monospace" class=3D"">Status Code:=
 503, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: SlowDown=
, AWS Error Message: Please reduce your request rate., S3 Extended Request =
ID: ...<br class=3D"">
</font></div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">We enabled CloudWatch S3 request metrics for one of our buc=
kets and I was a little alarmed to see spikes of over 800k S3 requests over=
 a minute or so, with the bulk of them HEAD requests.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">We read and write Parquet files, and most tables have aroun=
d 50 shards/parts, though some have up to 200. I imagine there's additional=
 parallelism when reading a shard in Parquet, though.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Has anyone else encountered this? How did you solve it?</di=
v>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">I'd sure prefer to avoid copying all our data in and out of=
 HDFS for each job, if possible.</div>
<div class=3D""><br class=3D"">
</div>
<div class=3D"">Thanks!</div>
<div class=3D""><br class=3D"">
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</div>
</div>
</div>
</blockquote>
</div>
<br class=3D"">
</body>
</html>

--_000_E16A829AAEF24A558C9CB531AC978414hortonworkscom_--