Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of hemanty@thoughtworks.com
 designates 74.125.149.155 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACD21EO=LxpJaozrLVTRotjgAnFxrRFzAjRbPHWS+Bn4Akd6HA@mail.gmail.com>
References: 
 <CAO6W-2dTgUhxyPScHTyVMAUTrxUtVFOmkcf5rERrsMwZxcVS3Q@mail.gmail.com>
	<4A3B3466BCAEF24E80F8EB422B1EE0010F011670@MBX021-E3-NJ-6.exch021.domain.local>
	<CACD21EO=LxpJaozrLVTRotjgAnFxrRFzAjRbPHWS+Bn4Akd6HA@mail.gmail.com>
Date: Wed, 26 Sep 2012 14:22:48 +0530
Message-ID: 
 <CAEAKFL-8u2=CEQH619gCOnb3aMkFZeiVyBYcv9SYV2M+8xyU0Q@mail.gmail.com>
Subject: Re: Detect when file is not being written by another process
From: Hemanth Yamijala <yhemanth@thoughtworks.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf3079ba4870d84c04ca96f2d2

--20cf3079ba4870d84c04ca96f2d2
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Agree with Bejoy. The problem you've mentioned sounds like building
something like a workflow, which is what Oozie is supposed to do.

Thanks
hemanth

On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:

> Hi Peter
>
> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
> soon as the files are written to a  certain hdfs directory.
>
>
> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan <
> psheridan@millennialmedia.com> wrote:
>
>>  These are log files being deposited by other processes, which we may
>> not have control over.
>>
>>  We don't want multiple processes to write to the same files =97 we just
>> don't want to start our jobs until they have been completely written.
>>
>>  Sorry for lack of clarity & thanks for the response.
>>
>>
>>  --Pete
>>
>>   From: Bertrand Dechoux <dechouxb@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Date: Tuesday, September 25, 2012 12:33 PM
>> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Subject: Re: Detect when file is not being written by another process
>>
>>  Hi,
>>
>> Multiple files and aggregation or something like hbase?
>>
>> Could you tell use more about your context? What are the volumes? Why do
>> you want multiple processes to write to the same file?
>>
>> Regards
>>
>> Bertrand
>>
>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
>> psheridan@millennialmedia.com> wrote:
>>
>>>  Hi all.
>>>
>>>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>> files when they've finished being written to HDFS by a different proces=
s.
>>>  There doesn't appear to be an API specifically for this.  We had
>>> discovered through experimentation that the FileSystem.append() method =
can
>>> be used for this purpose =97 it will fail if another process is writing=
 to
>>> the file.
>>>
>>>  However: when running this on a multi-node cluster, using that API
>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at=
 the
>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
>>> bunch of similar-sounding things.
>>>
>>>  What's the right way to solve this problem?  Thanks.
>>>
>>>
>>>  --Pete
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

--20cf3079ba4870d84c04ca96f2d2
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Agree with Bejoy. The problem you&#39;ve mentioned sounds like building som=
ething like a workflow, which is what Oozie is supposed to do.<div><br></di=
v><div>Thanks</div><div>hemanth<br><br><div class=3D"gmail_quote">On Wed, S=
ep 26, 2012 at 12:22 AM, Bejoy Ks <span dir=3D"ltr">&lt;<a href=3D"mailto:b=
ejoy.hadoop@gmail.com" target=3D"_blank">bejoy.hadoop@gmail.com</a>&gt;</sp=
an> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Peter<div><br></div><div>AFAIK oozie has =
a mechanism to=A0achieve=A0this. You can trigger your jobs as soon as the f=
iles are written to a =A0certain hdfs directory.<div>
<div class=3D"h5"><br><br><div class=3D"gmail_quote">On Tue, Sep 25, 2012 a=
t 10:23 PM, Peter Sheridan <span dir=3D"ltr">&lt;<a href=3D"mailto:psherida=
n@millennialmedia.com" target=3D"_blank">psheridan@millennialmedia.com</a>&=
gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:break=
-word">
<div>These are log files being deposited by other processes, which we may n=
ot have control over.</div>
<div><br>
</div>
<div>We don&#39;t want multiple processes to write to the same files =97 we=
 just don&#39;t want to start our jobs until they have been completely writ=
ten.</div>
<div><br>
</div>
<div>Sorry for lack of clarity &amp; thanks for the response.</div>
<div><br>
</div>
<div><br>
</div>
<div>--Pete</div>
<div><br>
</div>
<span>
<div style=3D"border-right:medium none;padding-right:0in;padding-left:0in;p=
adding-top:3pt;text-align:left;font-size:11pt;border-bottom:medium none;fon=
t-family:Calibri;border-top:#b5c4df 1pt solid;padding-bottom:0in;border-lef=
t:medium none">


<span style=3D"font-weight:bold">From: </span>Bertrand Dechoux &lt;<a href=
=3D"mailto:dechouxb@gmail.com" target=3D"_blank">dechouxb@gmail.com</a>&gt;=
<br>
<span style=3D"font-weight:bold">Reply-To: </span>&quot;<a href=3D"mailto:u=
ser@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot; &=
lt;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.=
apache.org</a>&gt;<br>


<span style=3D"font-weight:bold">Date: </span>Tuesday, September 25, 2012 1=
2:33 PM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@ha=
doop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot; &lt;<a =
href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache=
.org</a>&gt;<br>


<span style=3D"font-weight:bold">Subject: </span>Re: Detect when file is no=
t being written by another process<br>
</div><div><div>
<div><br>
</div>
<div>
<div>Hi,<br>
<br>
Multiple files and aggregation or something like hbase?<br>
<br>
Could you tell use more about your context? What are the volumes? Why do yo=
u want multiple processes to write to the same file?<br>
<br>
Regards<br>
<br>
Bertrand<br>
<br>
<div class=3D"gmail_quote">On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan =
<span dir=3D"ltr">
&lt;<a href=3D"mailto:psheridan@millennialmedia.com" target=3D"_blank">pshe=
ridan@millennialmedia.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:break=
-word">
<div>Hi all.</div>
<div><br>
</div>
<div>We&#39;re using Hadoop 1.0.3. =A0We need to pick up a set of large (4+=
GB) files when they&#39;ve finished being written to HDFS by a different pr=
ocess. =A0There doesn&#39;t appear to be an API specifically for this. =A0W=
e had discovered through experimentation that the
 FileSystem.append() method can be used for this purpose =97 it will fail i=
f another process is writing to the file.</div>
<div><br>
</div>
<div>However: when running this on a multi-node cluster, using that API act=
ually corrupts the file. =A0Perhaps this is a known issue? =A0Looking at th=
e bug tracker I see=A0<a href=3D"https://issues.apache.org/jira/browse/HDFS=
-265" target=3D"_blank">https://issues.apache.org/jira/browse/HDFS-265</a>=
=A0and
 a bunch of similar-sounding things.</div>
<div><br>
</div>
<div>What&#39;s the right way to solve this problem? =A0Thanks.</div>
<div><br>
</div>
<div><br>
</div>
<div>--Pete</div>
<div><br>
</div>
</div>
</blockquote>
</div>
<br>
<br clear=3D"all">
<br>
-- <br>
Bertrand Dechoux<br>
</div>
</div>
</div></div></span>
</div>

</blockquote></div><br></div></div></div>
</blockquote></div><br></div>

--20cf3079ba4870d84c04ca96f2d2--