Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of shahab.yunus@gmail.com
 designates 209.85.214.49 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <6678A26479D62E40927B180E42350D500335E741@EXGTMB19.nam.nsroot.net>
References: <6678A26479D62E40927B180E42350D500335E741@EXGTMB19.nam.nsroot.net>
Date: Tue, 20 Aug 2013 20:57:17 -0400
Message-ID: 
 <CAEo-6+RfpOmURh-ObDt8Ng_bwV16is7Mj=GSDYRgEWU+xqToXw@mail.gmail.com>
Subject: Re: read a changing hdfs file
From: Shahab Yunus <shahab.yunus@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=20cf30223bf5afebba04e46aa755

--20cf30223bf5afebba04e46aa755
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

As far as I understand (and experts can correct me), the file being written
will be visible once one HDFS block size worth of data is written. This
applies to subsequent writing as well. Basically a block size worth of data
is the level of coherency, the size/unit of data for which data durability
is guaranteed. You can forcefully call the sync (*hsync/hflush) method to
flush your writes to the file system so they become visible as you write
them but then it has a cost in the form of lesser performance. So basically
it is dependent on your application and requirements i.e. trade-off between
performance and data visibility/durability.

*Read more about the definition, differences and use of the appropriate
method here:
http://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html

Regards,
Shahab


On Tue, Aug 20, 2013 at 5:36 PM, Wu, Jiang2 <jiang2.wu@citi.com> wrote:

>  Hi,
>
> I did some experiments to read a changing hdfs file. It seems that the
> reading takes a snapshot at the file opening moment, and will not read an=
y
> data appended to the file afterwards. It=92s different from what happens =
when
> reading a changing local file. My code is as follows
>
>                         Configuration conf =3D new Configuration();
>                         InputStream in =3D null;
>                         try {
>                                 FileSystem fs =3D
> FileSystem.get(URI.create("hdfs://MyCluster/"),
>                                                 conf);
>                                 in =3D fs.open(new Path("/tmp/test.txt"))=
;
>                                 Scanner scanner=3Dnew Scanner(in);
>                                 while(scanner.hasNextLine()){
>
> System.out.println("+++++++++++++++++++++++++++++++ read
> "+scanner.nextLine());
>                                 }
>
> System.out.println("+++++++++++++++++++++++++++++++ reader finished ");
>                         } catch (IOException e) {
>                                 // TODO Auto-generated catch block
>                                 e.printStackTrace();
>                         } finally {
>                                 IOUtils.closeStream(in);
>                         }
>
> I=92m wondering if this is the designed hdfs reading behavior, or can be
> changed by using different API or configuration? What I expect is the sam=
e
> behavior as a local file reading: when a reader reads a file while anothe=
r
> writer is writing to the file, the reader will receive all data written b=
y
> the writer.
>
> Thanks,
> Jiang
>
>
>

--20cf30223bf5afebba04e46aa755
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">As far as I understand (and experts can correct me), the f=
ile being written will be visible once one HDFS block size worth of data is=
 written. This applies to subsequent writing as well. Basically a block siz=
e worth of data is the level of coherency, the size/unit of data for which =
data durability is guaranteed. You can forcefully call the sync (*hsync/hfl=
ush) method to flush your writes to the file system so they become visible =
as you write them but then it has a cost in the form of lesser performance.=
 So basically it is dependent on your application and requirements i.e. tra=
de-off between performance and data visibility/durability.<div>
<br></div><div>*Read more about the definition, differences and use of the =
appropriate method here:</div><div><a href=3D"http://hadoop-hbase.blogspot.=
com/2012/05/hbase-hdfs-and-durable-sync.html">http://hadoop-hbase.blogspot.=
com/2012/05/hbase-hdfs-and-durable-sync.html</a><br>
</div><div><br></div><div>Regards,</div><div>Shahab=A0</div></div><div clas=
s=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Tue, Aug 20, 2013 a=
t 5:36 PM, Wu, Jiang2 <span dir=3D"ltr">&lt;<a href=3D"mailto:jiang2.wu@cit=
i.com" target=3D"_blank">jiang2.wu@citi.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div>
<font face=3D"Calibri"><span style=3D"font-size:11pt">
<div>Hi,</div>
<div>=A0</div>
<div>I did some experiments to read a changing hdfs file. It seems that the=
 reading takes a snapshot at the file opening moment, and will not read any=
 data appended to the file afterwards. It=92s different from what happens w=
hen reading a changing local file.
My code is as follows</div>
<div>=A0</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 =
Configuration conf =3D new Configuration();</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 =
InputStream in =3D null;</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 =
try {</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 FileSystem fs =3D FileSystem.get(URI.create(&quot;=
hdfs://MyCluster/&quot;),</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 co=
nf);</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 in =3D fs.open(new Path(&quot;/tmp/test.txt&quot;)=
);</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 Scanner scanner=3Dnew Scanner(in);</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 while(scanner.hasNextLine()){</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 System.out.println(&quot;+=
++++++++++++++++++++++++++++++ read &quot;+scanner.nextLine());</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 }</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 System.out.println(&quot;+++++++++++++++++++++++++=
++++++ reader finished &quot;);</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 =
} catch (IOException e) {</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 // TODO Auto-generated catch block</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 e.printStackTrace();</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 =
} finally {</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0 IOUtils.closeStream(in);</div>
<div>=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 =
}</div>
<div>=A0</div>
<div>I=92m wondering if this is the designed hdfs reading behavior, or can =
be changed by using different API or configuration? What I expect is the sa=
me behavior as a local file reading: when a reader reads a file while anoth=
er writer is writing to the file,
the reader will receive all data written by the writer.</div>
<div>=A0</div>
<div>Thanks,</div>
<div>Jiang</div>
<div>=A0</div>
<div>=A0</div>
</span></font>
</div>

</blockquote></div><br></div>

--20cf30223bf5afebba04e46aa755--