incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-132) Repeated runs result in duplicated output data
Date Thu, 13 Dec 2012 19:22:15 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531356#comment-13531356
] 

Gabriel Reid commented on CRUNCH-132:
-------------------------------------

I've actually gotten so used to this in Crunch that it annoys me that default MapReduce throws
an exception if the output directory exists. 

I'd be more in favor of making this configurable behavior somehow, maybe with the options
of refusing to overwrite, writing new files next to existing ones, and totally wiping out
the output directory first if it exists. What do you think?
                
> Repeated runs result in duplicated output data
> ----------------------------------------------
>
>                 Key: CRUNCH-132
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-132
>             Project: Crunch
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>
> Usually when you run a mapreduce job and the output directory already exists, the job
fails (won't start). A Crunch job does run, but results in the output data being duplicated
in the output directory with numbered files that follow on from the previous run. 
> Example
> Run 1, single reducer /output -> /output/part-r-00000
> Run 2, single reducer /output -> /output/part-r-00000, /output/part-r-00001
> I didn't realise I'd run my job twice, so when I looked in the directory it seemed that
there had been 2 reducers and somehow the output had been generated twice, which was confusing.

> I realise this may be by design, but it feels wrong to me. I'd prefer if the behaviour
of a standard mapreduce job was preserved.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message