hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Mackrory (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-13868) S3A should configure multi-part copies and uploads separately
Date Tue, 06 Dec 2016 17:36:58 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Mackrory updated HADOOP-13868:
-----------------------------------
    Description: 
I've been looking at a big performance regression when writing to S3 from Spark that appears
to have been introduced with HADOOP-12891.

In the Amazon SDK, the default threshold for multi-part copies is 320x the threshold for multi-part
uploads (and the block size is 20x bigger), so I don't think it's necessarily wise for us
to have them be the same.

I did some quick tests and it seems to me the sweet spot when multi-part copies start being
faster is around 512MB. It wasn't as significant, but using 104857600 (Amazon's default) for
the blocksize was also slightly better.

I propose we do the following, although they're independent decisions:

(1) Split the configuration. Ideally, I'd like to have fs.s3a.multipart.copy.threshold and
fs.s3a.multipart.upload.threshold (and corresponding properties for the block size). But then
there's the question of what to do with the existing fs.s3a.multipart.* properties. Deprecation?
Leave it as a short-hand for configuring both (that's overridden by the more specific properties?).

(2) Consider increasing the default values. In my tests, 256 MB seemed to be where multipart
uploads came into their own, and 512 MB was where multipart copies started outperforming the
alternative. Would be interested to hear what other people have seen.

  was:
I've been looking at a big performance regression when writing to S3 from Spark that appears
to have been introduced with HADOOP-12891.

In the Amazon SDK, the default threshold for multi-part copies is 320x the threshold for multi-part
uploads (and the block size is 20x bigger), so I don't think it's wise for us 

I did some quick tests and it seems to me the sweet spot when multi-part copies start being
faster is around 512MB. It wasn't as significant, but using 104857600 (Amazon's default) for
the blocksize was also slightly better.

I propose we do the following, although they're independent.

(1) Split the configuration. Ideally, I'd like to have fs.s3a.multipart.copy.threshold and
fs.s3a.multipart.upload.threshold (and corresponding properties for the block size). But then
there's the question of what to do with the existing fs.s3a.multipart.* properties. Deprecation?
Leave it as a short-hand for configuring both (that's overridden by the more specific properties?).

(2) Consider increasing the default values. In my tests, 256 MB seemed to be where multipart
uploads came into their own, and 512 MB was where multipart copies started outperforming the
alternative. Would be interested to hear what other people have seen.


> S3A should configure multi-part copies and uploads separately
> -------------------------------------------------------------
>
>                 Key: HADOOP-13868
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13868
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.7.0, 3.0.0-alpha1
>            Reporter: Sean Mackrory
>            Assignee: Sean Mackrory
>
> I've been looking at a big performance regression when writing to S3 from Spark that
appears to have been introduced with HADOOP-12891.
> In the Amazon SDK, the default threshold for multi-part copies is 320x the threshold
for multi-part uploads (and the block size is 20x bigger), so I don't think it's necessarily
wise for us to have them be the same.
> I did some quick tests and it seems to me the sweet spot when multi-part copies start
being faster is around 512MB. It wasn't as significant, but using 104857600 (Amazon's default)
for the blocksize was also slightly better.
> I propose we do the following, although they're independent decisions:
> (1) Split the configuration. Ideally, I'd like to have fs.s3a.multipart.copy.threshold
and fs.s3a.multipart.upload.threshold (and corresponding properties for the block size). But
then there's the question of what to do with the existing fs.s3a.multipart.* properties. Deprecation?
Leave it as a short-hand for configuring both (that's overridden by the more specific properties?).
> (2) Consider increasing the default values. In my tests, 256 MB seemed to be where multipart
uploads came into their own, and 512 MB was where multipart copies started outperforming the
alternative. Would be interested to hear what other people have seen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message