Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5F07ADAA6 for ; Fri, 12 Oct 2012 15:36:06 +0000 (UTC) Received: (qmail 97181 invoked by uid 500); 12 Oct 2012 15:36:02 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 97127 invoked by uid 500); 12 Oct 2012 15:36:02 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 97104 invoked by uid 99); 12 Oct 2012 15:36:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Oct 2012 15:36:02 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates 209.85.214.176 as permitted sender) Received: from [209.85.214.176] (HELO mail-ob0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Oct 2012 15:35:56 +0000 Received: by mail-ob0-f176.google.com with SMTP id x4so3611493obh.35 for ; Fri, 12 Oct 2012 08:35:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=i6dUSzefYkUgJJIkhvP4yNlurzbVyVwqwt7uYUKUXo8=; b=f/SsACf7SU7ay9prcYf13aHvrtgD5O73DGhJOhAVr9ePpA1Bp6E53F98nGKxa+4rGt +lILklNa2gPak7KDAOKFPRjW54oDulfQeyn4YL7WdyixGqrzmzzF0QXj7xR99AMwbtcy ZJp57bxCEVHnL5iTBH34j/GvuBx0jkQQw2PaxIHJQm8oZFz1PGHsQsPq+m4dTmfupgt4 lyJfy/dE/KXU7qQld+RwljqqFnuFV6ngvCFIhDZxnKPgI7azdzbKBt6dqYRBJtysMju+ 4iB0TwrRF8qvVwkqAvQ1N7o4tfng5b+Gi/rWM5KXspsThzXssqc4MgcSiLY0zNhPVkcB wztA== Received: by 10.182.54.103 with SMTP id i7mr3751506obp.62.1350056135121; Fri, 12 Oct 2012 08:35:35 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.162.163 with HTTP; Fri, 12 Oct 2012 08:35:14 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Fri, 12 Oct 2012 21:05:14 +0530 Message-ID: Subject: Re: concurrency To: common-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQkv+Sp4N3vz+Ghg5czzDesFWv6D+A2SXHs3qVqBT5bpdLtTBjJYj641G8Tj0MRwXaIErjjs Hey Koert, Yes the _SUCCESS (Created on successful commit-end of a job) file existence may be checked before firing the new job with the chosen input directory. This is consistent with what Oozie does as well. Since the listing of files happens post-submit() call, doing this will "just work" :) On Fri, Oct 12, 2012 at 8:00 PM, Koert Kuipers wrote: > We have a dataset that is heavily partitioned, like this > /data > partition1/ > _SUCESS > part-00000 > part-00001 > ... > partition1/ > _SUCCESS > part-00000 > part-00001 > .... > ... > > We have loaders that use map-red jobs to add new partitions to this data > set at a regular interval (so they write to new sub-directories). > > We also have map-red queries that read from the entire dataset (/data/*). > My worry here is concurrency. It will happen that a query job runs > while a loader > job is adding a new partition at the same time. Is there a risk that the query > could read incomplete or corrupt files? Is there a way to use the _SUCESS > files to prevent this from happening? > Thanks for your time! > Best, > Koert -- Harsh J