hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gera Shegalov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
Date Tue, 23 Dec 2014 22:22:15 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257608#comment-14257608

Gera Shegalov commented on MAPREDUCE-4815:

I think we should strive for a solution that does not create any sibling directories as it
will surprise users, and it would mean that checkOutputSpec everywhere needs to be adjusted
in derived classes. I think we can modify the behavior of the FOC based on [~l201514]'s idea
but still use the existing directory structure for backwards-compatibility:

task attempts write as usual to $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/

# rename *all* files '$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/foo' to
# rename '$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID' to  '$joboutput/_temporary/$appAttemptID/$taskID'
, which is the actual commit

# if '$joboutput/_temporary/$(appAttemptID - 1)/$taskID' exists: rename to '$joboutput/_temporary/$appAttemptID/$taskID'
# for backwards compatibility after upgrade to the new logic, check if there are any '$joboutput/_temporary/$appAttemptID/$taskID/foo'
and rename them to '$joboutput/foo'

# blow away $joboutput/_temporary
# write $joboutput/_SUCCESS

> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>         Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, MAPREDUCE-4815.v5.patch,
MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, MAPREDUCE-4815.v8.patch
> If a job generates many files to commit then the commitJob method call at the end of
the job can take minutes.  This is a performance regression from 1.x, as 1.x had the tasks
commit directly to the final output directory as they were completing and commitJob had very
little to do.  The commit work was processed in parallel and overlapped the processing of
outstanding tasks.  In 0.23/2.x, the commit is single-threaded and waits until all tasks have
completed before commencing.

This message was sent by Atlassian JIRA

View raw message