hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9981) Listing in RawLocalFileSystem is inefficient
Date Wed, 25 Sep 2013 01:59:03 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13777066#comment-13777066
] 

Andrew Wang commented on HADOOP-9981:
-------------------------------------

Hi Colin, thanks for the patch. Review as follows:

Nitty: lines longer than 80 chars:
{code}
        if ((componentIdx < components.size() - 1) && (!globFilter.hasPattern()))
{
...
            FileStatus childStatus = getFileStatus(new Path(candidate.getPath(), component));
{code}

* Let's use @Ignore annotations on the tests instead of removing them, I assume we want to
add them back in eventually?
* I think we have an existing bug in the paths of the returned FileStatus. When going through
a glob, it sets the path to the built-up path which can include symlinks, while for a non-glob
it's using {{getFileStatus}} which has a resolved path. I'm pretty sure FileStatus are supposed
to have a resolved path. This is complicated by how PathFilter still needs to compare against
the complete built-up path; maybe we could do something like:
{code}
if (filter.accept(new Path(prefix, status.getPath().getName()))) {
{code}
* Our symlink resolution right now is inconsistent: listStatus does not resolve results, getFileStatus
does. Shouldn't this be getFileLinkStatus? Or are we waiting to fix this again in HDFS-9877
when it gets recommitted? I know HADOOP-9972 with the new APIs is coming down the pipe, so
I just wanted to bring this up.
* I'd like to see tests that would have caught these correctness concerns: that resolved paths
are returned correctly (with and without a wildcard), that PathFilters are matching against
built-up paths as expected (with and without wildcards), and the looping {{/a/b -> ..}}
symlink case you mentioned in a comment. Whether it's a terminal or intermediate wildcard
also matters here. There are unfortunately a lot of edge cases.
* Also noticed that we have a little duplication in TestGlobPaths: {{trueFilter}} is the same
as {{AcceptAllFilter}}. {{AcceptPathsEndingInZ}} is also only used in the removed test.
                
> Listing in RawLocalFileSystem is inefficient
> --------------------------------------------
>
>                 Key: HADOOP-9981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9981
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Kihwal Lee
>            Assignee: Colin Patrick McCabe
>            Priority: Critical
>         Attachments: HADOOP-9981.001.patch, HADOOP-9981.002.patch
>
>
> After HADOOP-9652, listStatus() or globStatus() calls against a local file system directory
is very slow.  A user was loading data from local file system to Hive and it took about 30
seconds. The same operation took less than a second pre-HADOOP-9652. 
> The input path had many other files beside the input files and strace showed that fork
& exec of stat against each and every one of them. jstack confirmed that this was being
done from getNativeFileLinkStatus().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message