incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Beech (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CRUNCH-132) Repeated runs result in duplicated output data
Date Thu, 13 Dec 2012 15:24:14 GMT
Dave Beech created CRUNCH-132:
---------------------------------

             Summary: Repeated runs result in duplicated output data
                 Key: CRUNCH-132
                 URL: https://issues.apache.org/jira/browse/CRUNCH-132
             Project: Crunch
          Issue Type: Bug
    Affects Versions: 0.4.0
            Reporter: Dave Beech


Usually when you run a mapreduce job and the output directory already exists, the job fails
(won't start). A Crunch job does run, but results in the output data being duplicated in the
output directory with numbered files that follow on from the previous run. 

Example
Run 1, single reducer /output -> /output/part-r-00000
Run 2, single reducer /output -> /output/part-r-00000, /output/part-r-00001

I didn't realise I'd run my job twice, so when I looked in the directory it seemed that there
had been 2 reducers and somehow the output had been generated twice, which was confusing.


I realise this may be by design, but it feels wrong to me. I'd prefer if the behaviour of
a standard mapreduce job was preserved.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message