hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sumit Kumar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10634) Add recursive list apis to FileSystem to give implementations an opportunity for optimization
Date Mon, 02 Jun 2014 16:35:04 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14015522#comment-14015522

Sumit Kumar commented on HADOOP-10634:

That was a great suggestion [~stevel@apache.org] and thanks for clarifying purpose of listLocatedStatus
apis. It was confusing when i started working on this patch. I've updated patch for MAPREDUCE-5907
to use these iterator based apis that should address memory concerns. 

I'm still going through HADOOP-10400, on high level it's a great enhancement but i've few
notes that i would share in a day or two (still going through the patch).

> Add recursive list apis to FileSystem to give implementations an opportunity for optimization
> ---------------------------------------------------------------------------------------------
>                 Key: HADOOP-10634
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10634
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 2.4.0
>            Reporter: Sumit Kumar
>         Attachments: HADOOP-10634.patch
> Currently different code flows in hadoop use recursive listing to discover files/folders
in a given path. For example in FileInputFormat (both mapreduce and mapred implementations)
this is done while calculating splits. They however do this by doing listing level by level.
That means to discover files in /foo/bar means they do listing at /foo/bar first to get the
immediate children, then make the same call on all immediate children for /foo/bar to discover
their immediate children and so on. This doesn't scale well for fs implementations like s3
because every listStatus call ends up being a webservice call to s3. In cases where large
number of files are considered for input, this makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that give opportunity to the s3 fs implementation
to optimize. The behavior remains the same for other implementations (that is a default implementation
is provided for other fs so they don't have to implement anything new). However for s3 it
provides a simple change (as shown in the patch) to improve listing performance.

This message was sent by Atlassian JIRA

View raw message