hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13430) Optimize and fix getFileStatus in S3A
Date Wed, 27 Jul 2016 10:35:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395406#comment-15395406

Steve Loughran commented on HADOOP-13430:

(I've just moved this under the S3a phase III JIRA —stuff for Hadoop 2.9)

Regarding the feature, yes, we need it. You can see from the metrics we're collecting how
expensive it is.

FWIW I did play with this, reordering the operations —but things didn't work so I didn't
create a JIRA. That's a "failed assertions" didn't work rather than performance problems —so
probably a bug in my edit.

There's a couple of other optimisation points to consider too

# sometimes, S3A checks internally for directories (e.g mkdirs). It may be able to use some
knowledge of above/below the tree to make better decisions, or at least look for less information.
Example: if looking to see if there is a fake directory, there's no need to look for a non-fake

# sometimes the getFileStatus is to be followed immediately by (if it is a directory), a listStatusCall.
Examples: rename(), delete(). In these situations, we ought to be able to ask for a bigger
list in getFileStatus —and feed the result straight into the next stage of the work. We'd
get a bigger result back from that first list, but a whole list call could be eliminated.
But there that strategy of dropping the delimiter is potentially dangerous; it depends on
which call is happening.

> Optimize and fix getFileStatus in S3A
> -------------------------------------
>                 Key: HADOOP-13430
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13430
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steven K. Wong
>            Priority: Minor
> Currently, S3AFileSystem.getFileStatus(Path f) sends up to 3 requests to S3 when pathToKey(f)
= key = "foo/bar" is a directory:
> 1. HEAD key=foo/bar \[continue if not found]
> 2. HEAD key=foo/bar/ \[continue if not found]
> 3. LIST prefix=foo/bar/ delimiter=/ max-keys=1
> My experience (and generally true, I reckon) is that almost all directories are nonempty
directories without a "fake directory" file (e.g. "foo/bar/"). Under this condition, request
#2 is mostly unhelpful; it only slows down getFileStatus. Therefore, I propose swapping the
order of requests #2 and #3.
> Furthermore, when key = "foo/bar" is a nonempty directory that contains a "fake directory"
file (in addition to actual files), getFileStatus currently returns an S3AFileStatus with
isEmptyDirectory=true, which is wrong. Swapping will fix this. The swapped LIST request will
use max-keys=2 to determine isEmptyDirectory correctly. The swapped HEAD request will be skipped
if the directory is empty. (Removing the delimiter from the LIST request should make the logic
a little simpler than otherwise.)
> Note that key = "foo/bar/" has the same problem with isEmptyDirectory. To fix it, I propose
skipping request #1 when key ends with "/". The price is this will, for an empty directory,
replace a HEAD request with a LIST request that's generally more taxing on S3.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message