hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Marion (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files
Date Wed, 10 Dec 2014 18:48:15 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241525#comment-14241525

Dave Marion commented on MAPREDUCE-4815:

I think we might be seeing a side effect of patch #8. What we are seeing is an output directory
being created underneath the location where it should be. For example, if we expect files
in dir1/dir2 there are times when we see /dir1/dir2/dir2. I think the problem stems from the
call to mergePaths now being called from commitTask, and there is a race condition when two
tasks complete at the same time. Specifically, its the last case in mergePaths when 'from'
does not exist, so it calls rename. 

I traced this, hopefully correctly, to FSNamesystem.renameToInternal() which has a nasty comment
about doing something that it shouldn't. It also appears to create dir1/dir2/dir2. I think
this is a bug in FSNamesystem. For example if 

from = /pathA/dir1/dir2
to = /pathB/dir1/dir2

What happens when two processes call fs.rename(from,to) at the same time?

> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>         Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, MAPREDUCE-4815.v5.patch,
MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, MAPREDUCE-4815.v8.patch
> If a job generates many files to commit then the commitJob method call at the end of
the job can take minutes.  This is a performance regression from 1.x, as 1.x had the tasks
commit directly to the final output directory as they were completing and commitJob had very
little to do.  The commit work was processed in parallel and overlapped the processing of
outstanding tasks.  In 0.23/2.x, the commit is single-threaded and waits until all tasks have
completed before commencing.

This message was sent by Atlassian JIRA

View raw message