hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Finnessy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1471) FileOutputCommitter does not safely clean up it's temporary files
Date Wed, 10 Feb 2010 21:13:28 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832199#action_12832199
] 

Jim Finnessy commented on MAPREDUCE-1471:
-----------------------------------------

It is a pretty common use case for me, and I'm guessing others, to write a number of files
to the same output directory from concurrent jobs. 

For instance counting the number of occurrences of a daily event for a entire month

Setting up the directory structure:
.../2008/07/

With all of the count files for that month ending up in that directory.

When running concurrent jobs 1 and 2, hadoop creates the following temporary directories/files
with a modified version of SequenceFileOutputFormat, that allows me name the file what I desire
and run with the same working path for multiple jobs.
.../2008/07/_temporary/_attempt_local_0001_r_000000_0
.../2008/07/_temporary/_attempt_local_0002_r_000000_0

When the job 1 completes, in order to clean up it's temporary files, it removes
.../2008/07/_temporary/

This then blows away the temporary files for job 2.

I would say that normally this is not a hadoop problem because of the way the I extended SequenceFileOutputFormat
to allow for mutliple jobs to have the same working path so long as the output file name is
unique (which it is). However, it is the cleanupJob in FileOutputCommitter that causes the
problem, and since the committer in FileOutputFormat is private, I cannot extend and replace
the FileOutputCommitter with my own. Currently I have overidden the getOutputCommiter(context)
method in my out FileOutputFormat to work around this, but if the class ever starts accessing
the committer without going through that method, I'm in trouble again.

So....I'd really appreciate it if the commiter in FIleOutputCommitter is protected rather
than private in FileOutputFormat so that it can be overridden for client applications or the
jobs only cleanup the temporary files that they create rather than recursively deleting the
highest level directory (_temporary).

Thanks,
Jim



> FileOutputCommitter does not safely clean up it's temporary files
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1471
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1471
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>            Reporter: Jim Finnessy
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When the FileOutputCommitter cleans up during it's cleanupJob method, it potentially
deletes the temporary files of other concurrent jobs.
> Since all the temporary files for all concurrent jobs are written to working_path/_temporary/
any concurrent tasks that have the same working_path will remove all currently executing jobs
when it removes working_path/_temporary during job cleanup.
> If the file name output is guaranteed by the client application to be unique, the temporary
files/directories should also be guaranteed to be unique to avoid this problem. Suggest modifying
cleanupJob to only remove files that it created itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message