incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-132) Add configurable behavior for when a pipeline output directory already exists
Date Sun, 10 Feb 2013 21:09:13 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575523#comment-13575523
] 

Gabriel Reid commented on CRUNCH-132:
-------------------------------------

I think that the behavior that you described (needing to use the APPEND strategy on the second
call to Pipeline#write) actually makes a lot of sense, although I think it would be better
to be consistent in that, i.e. that the second call to Pipeline#write in your last example
should fail unless APPEND is used despite the fact that there's no call to Pipeline#run in
between.

Of course, this means that the second call to Pipeline#write could also use the OVERWRITE
strategy, which (although it doesn't make sense) makes it difficult to decide what the correct
thing to do is, as it's not easy to detect this situation using the eager evaluation approach.
I'm not sure how to get around that at the moment, but I do think that it would be good to
be consistent regardless of whether or not Pipeline#run is called between calls to Pipeline#write.

As far as the API itself goes, what do you think of calling the mode enumeration "WriteMode"
(with entries DEFAULT, APPEND, and OVERWRITE) instead of calling it ExistingOutputStrategy?
My gut feeling is that WriteMode is a bit more clear for the public API, while ExistingOutputStrategy
is more suited as an internal name. What do you think?
                
> Add configurable behavior for when a pipeline output directory already exists
> -----------------------------------------------------------------------------
>
>                 Key: CRUNCH-132
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-132
>             Project: Crunch
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>         Attachments: CRUNCH-132.patch, CRUNCH-132-proto.patch
>
>
> Usually when you run a mapreduce job and the output directory already exists, the job
fails (won't start). A Crunch job does run, but results in the output data being duplicated
in the output directory with numbered files that follow on from the previous run. 
> Example
> Run 1, single reducer /output -> /output/part-r-00000
> Run 2, single reducer /output -> /output/part-r-00000, /output/part-r-00001
> I didn't realise I'd run my job twice, so when I looked in the directory it seemed that
there had been 2 reducers and somehow the output had been generated twice, which was confusing.

> I realise this may be by design, but it feels wrong to me. I'd prefer if the behaviour
of a standard mapreduce job was preserved.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message