hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siqi Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
Date Tue, 13 Jan 2015 19:17:35 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275733#comment-14275733
] 

Siqi Li commented on MAPREDUCE-4815:
------------------------------------

I have attached a patch v9 based on the design suggestions from Jason and Gera.

Also, I have run a bunch of performance testing job as follows,

1. Teragen job with 500 mappers 

                         Job Execution Time        Job Commit Time
Old APIs                    43 sec                            31 sec
New APIs                  31 sec                            0.2 sec
Savings                     ~38.7%

2. Teragen job with 5K mappers 

                         Job Execution Time        Job Commit Time
Old APIs                    6 min 8 sec                   2 min
New APIs                  4 min 10 sec                 0.3 sec
Savings                     ~33.3%

3. Teragen job with 20K mappers 

                         Job Execution Time        Job Commit Time
Old APIs                23 min 45 sec                   10 min
New APIs              15 min 36 sec                   0.5 sec
Savings                     ~33.3%

According to the tables above, the average time saving of teragen job is ~33.3%, and the job
commit time of new API is almost instant when compared to old APIs, which is linear to the
number of tasks. Noted that this is when the entire cluster is used by this job only. In actual
scenario, the job commit time may take much longer when NNs are under heavy load.

In addition, this new APIs are optimized for large jobs with small average task finish time.
Because, this kind of job require less time to finish all task, but use a lot of time doing
committing using old APIs. This means a large portion of overall job time is used to commit.
However, with the new APIs commit time is largely reduced, hence, the saving is huge.

For the long running small jobs, the saving might be negligible, but it will not be worse
than the old APIs





> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>         Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, MAPREDUCE-4815.v5.patch,
MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, MAPREDUCE-4815.v8.patch, MAPREDUCE-4815.v9.patch
>
>
> If a job generates many files to commit then the commitJob method call at the end of
the job can take minutes.  This is a performance regression from 1.x, as 1.x had the tasks
commit directly to the final output directory as they were completing and commitJob had very
little to do.  The commit work was processed in parallel and overlapped the processing of
outstanding tasks.  In 0.23/2.x, the commit is single-threaded and waits until all tasks have
completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message