Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Fri, 6 Mar 2015 22:55:41 +0000 (UTC)
From: "Siqi Li (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.12617132.1353516439000.25751.1425682541619@Atlassian.JIRA>
In-Reply-To: <JIRA.12617132.1353516439000@Atlassian.JIRA>
References: <JIRA.12617132.1353516439000@Atlassian.JIRA>
 <JIRA.12617132.1353516439368@arcas>
Subject: [jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob
 can be very slow for jobs with many output files
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351086#comment-14351086 ] 

Siqi Li commented on MAPREDUCE-4815:
------------------------------------

I have updated the patch by removing the unnecessary documentation that may confuse users
{code}
- After upgrading the file output committer version from 1 to 2,
-  all newly submitted jobs will use algorithm 2. However, during
-  the upgrade, if AM attempt of old jobs fails and restarts, the
-  new AM attempt will pick up algorithm 2 and try to recover the
-  task output from previous attempt. Algorithm 2 is able to handle
-  this case properly by moving all previously committed task files
-  to the final output directory.

-  Note: this doesn't support the rolling upgrade, it can only tolerate
-  full upgrade, after which the algorithm can be set to 2.
{code}

> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>         Attachments: MAPREDUCE-4815.v10.patch, MAPREDUCE-4815.v11.patch, MAPREDUCE-4815.v12.patch, MAPREDUCE-4815.v13.patch, MAPREDUCE-4815.v14.patch, MAPREDUCE-4815.v15.patch, MAPREDUCE-4815.v16.patch, MAPREDUCE-4815.v17.patch, MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, MAPREDUCE-4815.v5.patch, MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, MAPREDUCE-4815.v8.patch, MAPREDUCE-4815.v9.patch
>
>
> If a job generates many files to commit then the commitJob method call at the end of the job can take minutes.  This is a performance regression from 1.x, as 1.x had the tasks commit directly to the final output directory as they were completing and commitJob had very little to do.  The commit work was processed in parallel and overlapped the processing of outstanding tasks.  In 0.23/2.x, the commit is single-threaded and waits until all tasks have completed before commencing.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)