Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of psheridan@millennialmedia.com
 designates 206.225.164.219 as permitted sender)
From: Peter Sheridan <psheridan@millennialmedia.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: Re: Detect when file is not being written by another process
Thread-Topic: Detect when file is not being written by another process
Thread-Index: AQHNmzrJVSF86hg/yEW5YFEmhaCdDZebtdUA///ClAA=
Date: Tue, 25 Sep 2012 16:53:32 +0000
Message-ID: 
 <4A3B3466BCAEF24E80F8EB422B1EE0010F011670@MBX021-E3-NJ-6.exch021.domain.local>
In-Reply-To: 
 <CAO6W-2dTgUhxyPScHTyVMAUTrxUtVFOmkcf5rERrsMwZxcVS3Q@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_4A3B3466BCAEF24E80F8EB422B1EE0010F011670MBX021E3NJ6exch_"
MIME-Version: 1.0

--_000_4A3B3466BCAEF24E80F8EB422B1EE0010F011670MBX021E3NJ6exch_
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable

These are log files being deposited by other processes, which we may not ha=
ve control over.

We don't want multiple processes to write to the same files =97 we just don=
't want to start our jobs until they have been completely written.

Sorry for lack of clarity & thanks for the response.


--Pete

From: Bertrand Dechoux <dechouxb@gmail.com<mailto:dechouxb@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@had=
oop.apache.org<mailto:user@hadoop.apache.org>>
Date: Tuesday, September 25, 2012 12:33 PM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.ap=
ache.org<mailto:user@hadoop.apache.org>>
Subject: Re: Detect when file is not being written by another process

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do yo=
u want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <psheridan@millennialmedia.=
com<mailto:psheridan@millennialmedia.com>> wrote:
Hi all.

We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files w=
hen they've finished being written to HDFS by a different process.  There d=
oesn't appear to be an API specifically for this.  We had discovered throug=
h experimentation that the FileSystem.append() method can be used for this =
purpose =97 it will fail if another process is writing to the file.

However: when running this on a multi-node cluster, using that API actually=
 corrupts the file.  Perhaps this is a known issue?  Looking at the bug tra=
cker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch of si=
milar-sounding things.

What's the right way to solve this problem?  Thanks.


--Pete


--
Bertrand Dechoux

--_000_4A3B3466BCAEF24E80F8EB422B1EE0010F011670MBX021E3NJ6exch_
Content-Type: text/html; charset="Windows-1252"
Content-ID: <CE01929D5C522F40B26854860358B865@exch021.domain.local>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1=
252">
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-fami=
ly: Calibri, sans-serif; ">
<div>These are log files being deposited by other processes, which we may n=
ot have control over.</div>
<div><br>
</div>
<div>We don't want multiple processes to write to the same files =97 we jus=
t don't want to start our jobs until they have been completely written.</di=
v>
<div><br>
</div>
<div>Sorry for lack of clarity &amp; thanks for the response.</div>
<div><br>
</div>
<div><br>
</div>
<div>--Pete</div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b=
lack; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM:=
 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid;=
 BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style=3D"font-weight:bold">From: </span>Bertrand Dechoux &lt;<a href=
=3D"mailto:dechouxb@gmail.com">dechouxb@gmail.com</a>&gt;<br>
<span style=3D"font-weight:bold">Reply-To: </span>&quot;<a href=3D"mailto:u=
ser@hadoop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mail=
to:user@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Tuesday, September 25, 2012 1=
2:33 PM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@ha=
doop.apache.org">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mailto:use=
r@hadoop.apache.org">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>Re: Detect when file is no=
t being written by another process<br>
</div>
<div><br>
</div>
<div>
<div>Hi,<br>
<br>
Multiple files and aggregation or something like hbase?<br>
<br>
Could you tell use more about your context? What are the volumes? Why do yo=
u want multiple processes to write to the same file?<br>
<br>
Regards<br>
<br>
Bertrand<br>
<br>
<div class=3D"gmail_quote">On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan =
<span dir=3D"ltr">
&lt;<a href=3D"mailto:psheridan@millennialmedia.com" target=3D"_blank">pshe=
ridan@millennialmedia.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:break=
-word">
<div>Hi all.</div>
<div><br>
</div>
<div>We're using Hadoop 1.0.3. &nbsp;We need to pick up a set of large (4&#=
43;GB) files when they've finished being written to HDFS by a different pro=
cess. &nbsp;There doesn't appear to be an API specifically for this. &nbsp;=
We had discovered through experimentation that the
 FileSystem.append() method can be used for this purpose =97 it will fail i=
f another process is writing to the file.</div>
<div><br>
</div>
<div>However: when running this on a multi-node cluster, using that API act=
ually corrupts the file. &nbsp;Perhaps this is a known issue? &nbsp;Looking=
 at the bug tracker I see&nbsp;<a href=3D"https://issues.apache.org/jira/br=
owse/HDFS-265" target=3D"_blank">https://issues.apache.org/jira/browse/HDFS=
-265</a>&nbsp;and
 a bunch of similar-sounding things.</div>
<div><br>
</div>
<div>What's the right way to solve this problem? &nbsp;Thanks.</div>
<div><br>
</div>
<div><br>
</div>
<div>--Pete</div>
<div><br>
</div>
</div>
</blockquote>
</div>
<br>
<br clear=3D"all">
<br>
-- <br>
Bertrand Dechoux<br>
</div>
</div>
</span>
</body>
</html>

--_000_4A3B3466BCAEF24E80F8EB422B1EE0010F011670MBX021E3NJ6exch_--