hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ivan Bella (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4815) Speed up FileOutputCommitter#commitJob for many output files
Date Mon, 16 Mar 2015 19:24:41 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363745#comment-14363745

Ivan Bella commented on MAPREDUCE-4815:

[~l201514] I work with Dave and it appears after reviewing the last patch that the basic problem
we reported still exists.  Basically if mergePaths is called from two reducers at the same
time, and they have output files in the same subdirectory (the key is subdirectory here) then
we will get the situation as previously described.  So say the working directory is dir.1,
and two reducers are putting files in dir.1/dir.2, then this will non-deterministically result
in one of the reducers creating a dir.1/dir.2/dir.2.  We were in addition seeing that files
were being put in dir.1/dir.2/dir.2 which should have been moved into dir.1/dir.2.  We will
deploy this patch and confirm.

> Speed up FileOutputCommitter#commitJob for many output files
> ------------------------------------------------------------
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>              Labels: perfomance
>             Fix For: 2.7.0
>         Attachments: MAPREDUCE-4815.v10.patch, MAPREDUCE-4815.v11.patch, MAPREDUCE-4815.v12.patch,
MAPREDUCE-4815.v13.patch, MAPREDUCE-4815.v14.patch, MAPREDUCE-4815.v15.patch, MAPREDUCE-4815.v16.patch,
MAPREDUCE-4815.v17.patch, MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, MAPREDUCE-4815.v5.patch,
MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, MAPREDUCE-4815.v8.patch, MAPREDUCE-4815.v9.patch
> If a job generates many files to commit then the commitJob method call at the end of
the job can take minutes.  This is a performance regression from 1.x, as 1.x had the tasks
commit directly to the final output directory as they were completing and commitJob had very
little to do.  The commit work was processed in parallel and overlapped the processing of
outstanding tasks.  In 0.23/2.x, the commit is single-threaded and waits until all tasks have
completed before commencing.

This message was sent by Atlassian JIRA

View raw message