Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 95A49E94F for ; Sun, 10 Feb 2013 02:15:13 +0000 (UTC) Received: (qmail 80555 invoked by uid 500); 10 Feb 2013 02:15:13 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 80473 invoked by uid 500); 10 Feb 2013 02:15:13 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 80265 invoked by uid 99); 10 Feb 2013 02:15:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Feb 2013 02:15:12 +0000 Date: Sun, 10 Feb 2013 02:15:12 +0000 (UTC) From: "Josh Wills (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CRUNCH-132) Add configurable behavior for when a pipeline output directory already exists MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Wills updated CRUNCH-132: ------------------------------ Attachment: CRUNCH-132.patch Took another crack at this one today and got something that's at least worth considering. It seems to me that there are two ways to handle the existing output checking: eagerly (as this patch does) or lazily (which seemed to be more difficult to do correctly.) The eager strategy checks to see whether or not the output path exists immediately upon Pipeline.write(collection, target) being called. The default action in the case that the output exists is to throw a CrunchRuntimeException. This can be overridden to either OVERWRITE or APPEND to the existing output path by calling Pipeline.write(collection, target, strategy) or PCollection.write(target, strategy). My feeling was that the eager strategy was more in line with user's expectations of how Crunch would work, although it could potentially cause some unexpected failures, e.g., pipeline.write(pcollect1, target); pipeline.run(); pipeline.write(pcollect2, target); // exception thrown that the user would have to compensate for with, e.g., pipeline.write(pcollect1, target); pipeline.run(); pipeline.write(pcollect2, target, ExistingOutputStrategy.APPEND); Note that the similar: pipeline.write(pcollect1, target); pipeline.write(pcollect2, target); pipeline.run(); would not throw an exception, as the assumption would be that the user was intentionally writing both collections to the same output target. If folks have strong feelings about the design here, I am more than happy to hear them. > Add configurable behavior for when a pipeline output directory already exists > ----------------------------------------------------------------------------- > > Key: CRUNCH-132 > URL: https://issues.apache.org/jira/browse/CRUNCH-132 > Project: Crunch > Issue Type: Improvement > Affects Versions: 0.4.0 > Reporter: Dave Beech > Assignee: Josh Wills > Attachments: CRUNCH-132.patch, CRUNCH-132-proto.patch > > > Usually when you run a mapreduce job and the output directory already exists, the job fails (won't start). A Crunch job does run, but results in the output data being duplicated in the output directory with numbered files that follow on from the previous run. > Example > Run 1, single reducer /output -> /output/part-r-00000 > Run 2, single reducer /output -> /output/part-r-00000, /output/part-r-00001 > I didn't realise I'd run my job twice, so when I looked in the directory it seemed that there had been 2 reducers and somehow the output had been generated twice, which was confusing. > I realise this may be by design, but it feels wrong to me. I'd prefer if the behaviour of a standard mapreduce job was preserved. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira