hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
Date Wed, 12 Apr 2017 12:13:41 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965716#comment-15965716
] 

Steve Loughran commented on MAPREDUCE-5907:
-------------------------------------------

I don't know anyone looking at it.

It's an out of date patch, combining optimisations in the FS code, S3N and HAR FS implmentations,
& changes in the MR Code to match

If the changes to the mapreduce module can go in today, using the existing {{FileSystem.listFiles(path,
recursive}} call then it''ll be straightforward: that's the only bit which needs review and
merge; S3A already handles that recursively very efficiently, and the other object stores
can be brought up to speed.

If we need changes to the FS, well, I'm not against them (there's definite inconsistencies
there), but it's a more serious change: the HDFS team will need to look at that, we'll need
changes to the FS spec, contract tests, etc, etc. Lots of work and so harder to get in.

Why not see if you can apply just the MR changes, and what happens?

> Improve getSplits() performance for fs implementations that can utilize performance gains
from recursive listing
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5907
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: client
>    Affects Versions: 2.4.0
>            Reporter: Sumit Kumar
>            Assignee: Sumit Kumar
>              Labels: BB2015-05-TBR
>         Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, MAPREDUCE-5907.patch
>
>
> FileInputFormat (both mapreduce and mapred implementations) use recursive listing while
calculating splits. They however do this by doing listing level by level. That means to discover
files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then
make the same call on all immediate children for /foo/bar to discover their immediate children
and so on. This doesn't scale well for object store based fs implementations like s3 and swift
because every listStatus call ends up being a webservice call to backend. In cases where large
number of files are considered for input, this makes getSplits() call slow. 
> This patch adds a new set of recursive list apis that gives opportunity to the fs implementations
to optimize. The behavior remains the same for other implementations (that is a default implementation
is provided for other fs so they don't have to implement anything new). However for objectstore
based fs implementations it provides a simple change to include recursive flag as true (as
shown in the patch) to improve listing performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message