Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4E519DD6F for ; Wed, 26 Sep 2012 08:53:31 +0000 (UTC) Received: (qmail 80884 invoked by uid 500); 26 Sep 2012 08:53:26 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 79591 invoked by uid 500); 26 Sep 2012 08:53:21 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 79563 invoked by uid 99); 26 Sep 2012 08:53:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Sep 2012 08:53:20 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hemanty@thoughtworks.com designates 74.125.149.155 as permitted sender) Received: from [74.125.149.155] (HELO na3sys009aog126.obsmtp.com) (74.125.149.155) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Sep 2012 08:53:12 +0000 Received: from mail-vc0-f176.google.com ([209.85.220.176]) (using TLSv1) by na3sys009aob126.postini.com ([74.125.148.12]) with SMTP ID DSNKUGLCYvGKN+yHbABnRTYrZ5cZG81epMlh@postini.com; Wed, 26 Sep 2012 01:52:51 PDT Received: by vcbgb22 with SMTP id gb22so414771vcb.35 for ; Wed, 26 Sep 2012 01:52:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=GZzu44WXZ3CbquzT80X5qA58V5wNqXhDwj6dNjd+nrE=; b=WLGmHlcvymf6wIVNN1PiusPdNRlNYsiJ3aCkhi8g9tGaHkBDwVD38jgWtkYvD4qA7/ 9xmcoB9fFm4TlClHWcFF0WKGEP1Mbeha0fDPCHhss2DslWqmQucMD0vK6ceKVRTXRO1q 5+hnkfUpnVPuNVSKfxbkjJAZH4YOEsacLHWHzu1pHGCEY2qVyOlXx3y/LAn/FAiYku/U jo4Lj1j9G2OFNNXydIYwW2SR1naG9H9Z3+t/kqMJnlhfWuFben12Nd320xqgr+2zwdnN 0facdRZwl8/nQfiQfJeTKu0EH9YaXhd0baysYPnykUBhCkPRaW6T/YLlUD72JgcRsYO6 SuZQ== MIME-Version: 1.0 Received: by 10.52.35.116 with SMTP id g20mr8556954vdj.97.1348649568229; Wed, 26 Sep 2012 01:52:48 -0700 (PDT) Received: by 10.58.203.169 with HTTP; Wed, 26 Sep 2012 01:52:48 -0700 (PDT) In-Reply-To: References: <4A3B3466BCAEF24E80F8EB422B1EE0010F011670@MBX021-E3-NJ-6.exch021.domain.local> Date: Wed, 26 Sep 2012 14:22:48 +0530 Message-ID: Subject: Re: Detect when file is not being written by another process From: Hemanth Yamijala To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf3079ba4870d84c04ca96f2d2 X-Gm-Message-State: ALoCoQkhhrKuDYyKdUlyGssjXWkier4dvHM9W+BfToQlrdRhOLwm9KDW3vpPoKWjeC4hhuqE7isB --20cf3079ba4870d84c04ca96f2d2 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Agree with Bejoy. The problem you've mentioned sounds like building something like a workflow, which is what Oozie is supposed to do. Thanks hemanth On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks wrote: > Hi Peter > > AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as > soon as the files are written to a certain hdfs directory. > > > On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan < > psheridan@millennialmedia.com> wrote: > >> These are log files being deposited by other processes, which we may >> not have control over. >> >> We don't want multiple processes to write to the same files =97 we just >> don't want to start our jobs until they have been completely written. >> >> Sorry for lack of clarity & thanks for the response. >> >> >> --Pete >> >> From: Bertrand Dechoux >> Reply-To: "user@hadoop.apache.org" >> Date: Tuesday, September 25, 2012 12:33 PM >> To: "user@hadoop.apache.org" >> Subject: Re: Detect when file is not being written by another process >> >> Hi, >> >> Multiple files and aggregation or something like hbase? >> >> Could you tell use more about your context? What are the volumes? Why do >> you want multiple processes to write to the same file? >> >> Regards >> >> Bertrand >> >> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan < >> psheridan@millennialmedia.com> wrote: >> >>> Hi all. >>> >>> We're using Hadoop 1.0.3. We need to pick up a set of large (4+GB) >>> files when they've finished being written to HDFS by a different proces= s. >>> There doesn't appear to be an API specifically for this. We had >>> discovered through experimentation that the FileSystem.append() method = can >>> be used for this purpose =97 it will fail if another process is writing= to >>> the file. >>> >>> However: when running this on a multi-node cluster, using that API >>> actually corrupts the file. Perhaps this is a known issue? Looking at= the >>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a >>> bunch of similar-sounding things. >>> >>> What's the right way to solve this problem? Thanks. >>> >>> >>> --Pete >>> >>> >> >> >> -- >> Bertrand Dechoux >> > > --20cf3079ba4870d84c04ca96f2d2 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Agree with Bejoy. The problem you've mentioned sounds like building som= ething like a workflow, which is what Oozie is supposed to do.

Thanks
hemanth

On Wed, S= ep 26, 2012 at 12:22 AM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:
Hi Peter

AFAIK oozie has = a mechanism to=A0achieve=A0this. You can trigger your jobs as soon as the f= iles are written to a =A0certain hdfs directory.


On Tue, Sep 25, 2012 a= t 10:23 PM, Peter Sheridan <psheridan@millennialmedia.com&= gt; wrote:
These are log files being deposited by other processes, which we may n= ot have control over.

We don't want multiple processes to write to the same files =97 we= just don't want to start our jobs until they have been completely writ= ten.

Sorry for lack of clarity & thanks for the response.


--Pete

From: Bertrand Dechoux <dechouxb@gmail.com>=
Reply-To: "user@hadoop.apache.org" &= lt;user@hadoop.= apache.org>
Date: Tuesday, September 25, 2012 1= 2:33 PM
To: "user@hadoop.apache.org" <user@hadoop.apache= .org>
Subject: Re: Detect when file is no= t being written by another process

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do yo= u want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan = <pshe= ridan@millennialmedia.com> wrote:
Hi all.

We're using Hadoop 1.0.3. =A0We need to pick up a set of large (4+= GB) files when they've finished being written to HDFS by a different pr= ocess. =A0There doesn't appear to be an API specifically for this. =A0W= e had discovered through experimentation that the FileSystem.append() method can be used for this purpose =97 it will fail i= f another process is writing to the file.

However: when running this on a multi-node cluster, using that API act= ually corrupts the file. =A0Perhaps this is a known issue? =A0Looking at th= e bug tracker I see=A0https://issues.apache.org/jira/browse/HDFS-265= =A0and a bunch of similar-sounding things.

What's the right way to solve this problem? =A0Thanks.


--Pete




--
Bertrand Dechoux


--20cf3079ba4870d84c04ca96f2d2--