hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-15192) S3A listStatus excessively slow -hurts Spark job partitioning
Date Fri, 26 Jan 2018 22:23:00 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-15192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Steve Loughran updated HADOOP-15192:
    Environment: Amazon EMR

> S3A listStatus excessively slow -hurts Spark job partitioning
> -------------------------------------------------------------
>                 Key: HADOOP-15192
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15192
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 2.7.3
>         Environment: Amazon EMR
>            Reporter: Michel Lemay
>            Priority: Minor
>             Fix For: 2.8.0
> Symptoms:
>  - CloudWatch Metrics for S3 showing an unexpectedly large number of 4xx errors in our
>  - Performance when listing files recursively is abysmal (15 minutes on our bucket compared
to less than 2 minutes using cli `aws s3 ls`)
> Analysis:
>  - In CloudTrail logs for this bucket, we found that it generate one 404 (NoSuchKey)
error per folder listed recursively.
>  - Spark recursively calls FileSystem::listStatus (S3AFileSystem implementation from
Hadoop-aws:2.7.3); which in turn calls getFileStatus to determine if it is a directory.
>  - It turns out that this call to getFileStatus yield a 404 when the path used is a directory
but do not end with a slash. It then retries with the slash concatenated (incurring one extra
unneeded call to S3).
> Questions:
>  - Why is this trailing slash got removed in the first place? (Hadoop Path class normalize
it by removing trailing slashes when constructed)
>  - S3AFileSystem::listStatus needs to know if the path is a Directory. However, it’s
a common usage pattern to already have that FileStatus object in hand when recursively listing
files.  Thus incurring an unneeded performance penalty.  Base FileSystem class could offer
an optimized Api to use this assumption (or fix listLocatedStatus(recursive=true) unoptimized
call to listStatus)
>  - I might be wrong on this last bullet but I think S3 object api will fetch every objects
under a prefix (not just current level) and filter them out.  If that is the case, there
should be opportunities to have an efficient recursive listStatus implementation for s3 using
paginated calls to top level folder only.
> Note, all this is in the context of spark jobs reading hundred of thousands of parquet
files organized and partitioned hierarchically as recommended. Every time we read it, spark
lists recursively all files and folders to discover what are the partitions (folder names).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message