hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gera Shegalov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
Date Wed, 21 Jan 2015 09:14:38 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285390#comment-14285390
] 

Gera Shegalov commented on MAPREDUCE-4815:
------------------------------------------

Thanks for the latest patch, [~l201514]! Some comments/questions:

1. we are changing the behavior and not the api, we can have a property
{{mapreduce.fileoutputcommitter.algorithm.version}}
1: the old behavior. This should be the default unless we have solved the upgrade in an efficient
bullet-proof manner.
2: the new proposed design.

Why the flag for the new behavior is not initialized when {{FileOutputCommitter#FileOutputCommitter(Path,
TaskAttemptContext)}} is used.

There is a minor difference between {{runOldCommitJob}} and {{runNewCommitJob}} in that the
lengthy copy iterator is skipped. Therefore, no need to duplicate code. Enclose this copy
loop into some {{if (version == 1)}}. I think it’s sufficient to have such checks for {{commit/recoverTask}}
as well.

Code under the comment 
{code}
//for backwards compatibility after upgrade to the new fileOutputCommitter,
//check if there are any output left in committedTaskPath
{code} 
seems misplaced and should actually be under {{runNewRecoverTask}}. This scenario will need
a test. Equally the existing tests should be run under both the new and the old logic.

> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>         Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, MAPREDUCE-4815.v5.patch,
MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, MAPREDUCE-4815.v8.patch, MAPREDUCE-4815.v9.patch
>
>
> If a job generates many files to commit then the commitJob method call at the end of
the job can take minutes.  This is a performance regression from 1.x, as 1.x had the tasks
commit directly to the final output directory as they were completing and commitJob had very
little to do.  The commit work was processed in parallel and overlapped the processing of
outstanding tasks.  In 0.23/2.x, the commit is single-threaded and waits until all tasks have
completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message