We're using Spark 2.0.2 + Hadoop 2.7.3 on AWS EMR with S3A for direct I/O from/to S3 from our Spark jobs. We set mapreduce.fileoutputcommitter.algorithm.version=2 and are using encrypted S3 buckets.
This has been working fine for us, but perhaps as we've been running more jobs in parallel, we've started getting errors like
Status Code: 503, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: SlowDown, AWS Error Message: Please reduce your request rate., S3 Extended Request ID: ...
We enabled CloudWatch S3 request metrics for one of our buckets and I was a little alarmed to see spikes of over 800k S3 requests over a minute or so, with the bulk of them HEAD requests.
We read and write Parquet files, and most tables have around 50 shards/parts, though some have up to 200. I imagine there's additional parallelism when reading a shard in Parquet, though.
Has anyone else encountered this? How did you solve it?
I'd sure prefer to avoid copying all our data in and out of HDFS for each job, if possible.