hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9981) Listing in RawLocalFileSystem is inefficient
Date Wed, 25 Sep 2013 01:59:03 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13777066#comment-13777066

Andrew Wang commented on HADOOP-9981:

Hi Colin, thanks for the patch. Review as follows:

Nitty: lines longer than 80 chars:
        if ((componentIdx < components.size() - 1) && (!globFilter.hasPattern()))
            FileStatus childStatus = getFileStatus(new Path(candidate.getPath(), component));

* Let's use @Ignore annotations on the tests instead of removing them, I assume we want to
add them back in eventually?
* I think we have an existing bug in the paths of the returned FileStatus. When going through
a glob, it sets the path to the built-up path which can include symlinks, while for a non-glob
it's using {{getFileStatus}} which has a resolved path. I'm pretty sure FileStatus are supposed
to have a resolved path. This is complicated by how PathFilter still needs to compare against
the complete built-up path; maybe we could do something like:
if (filter.accept(new Path(prefix, status.getPath().getName()))) {
* Our symlink resolution right now is inconsistent: listStatus does not resolve results, getFileStatus
does. Shouldn't this be getFileLinkStatus? Or are we waiting to fix this again in HDFS-9877
when it gets recommitted? I know HADOOP-9972 with the new APIs is coming down the pipe, so
I just wanted to bring this up.
* I'd like to see tests that would have caught these correctness concerns: that resolved paths
are returned correctly (with and without a wildcard), that PathFilters are matching against
built-up paths as expected (with and without wildcards), and the looping {{/a/b -> ..}}
symlink case you mentioned in a comment. Whether it's a terminal or intermediate wildcard
also matters here. There are unfortunately a lot of edge cases.
* Also noticed that we have a little duplication in TestGlobPaths: {{trueFilter}} is the same
as {{AcceptAllFilter}}. {{AcceptPathsEndingInZ}} is also only used in the removed test.
> Listing in RawLocalFileSystem is inefficient
> --------------------------------------------
>                 Key: HADOOP-9981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9981
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Kihwal Lee
>            Assignee: Colin Patrick McCabe
>            Priority: Critical
>         Attachments: HADOOP-9981.001.patch, HADOOP-9981.002.patch
> After HADOOP-9652, listStatus() or globStatus() calls against a local file system directory
is very slow.  A user was loading data from local file system to Hive and it took about 30
seconds. The same operation took less than a second pre-HADOOP-9652. 
> The input path had many other files beside the input files and strace showed that fork
& exec of stat against each and every one of them. jstack confirmed that this was being
done from getNativeFileLinkStatus().

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message