Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of vijay.bhoomireddy@gmail.com
 designates 209.85.215.52 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <SNT149-W30E028392FAB1421AD3A14D0130@phx.gbl>
References: 
 <CAPRNhEMcZRo4KCwXGAR0Tpi2ungS09K9zECYgJkCGVKR+dWhBA@mail.gmail.com>
	<CAOSP=C3oG0ivVkSzQviG+QCECeWbwW2b=D8A_yAc7fESZfsZfQ@mail.gmail.com>
	<CAPRNhEN5m-7-F5TWnJ_Fh-BHw58TBoHC+YHcXArbC+vJ0q=gRw@mail.gmail.com>
	<SNT149-W30E028392FAB1421AD3A14D0130@phx.gbl>
Date: Fri, 20 Jun 2014 11:24:41 +0530
Message-ID: 
 <CAPRNhEO4PYAB6=wL8pvuxvETmAnhq9SN+eWum=pbwAqZdsC6YA@mail.gmail.com>
Subject: Re: HDFS File Writes & Reads
From: Vijaya Narayana Reddy Bhoomi Reddy <vijay.bhoomireddy@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a11c3c9e02f1b0404fc3e21b5

--001a11c3c9e02f1b0404fc3e21b5
Content-Type: text/plain; charset=UTF-8

Yong,

Thanks for the clarification. It was more of an academic query. We do not
have any performance requirements at this stage.

Regards
Vijay


On 19 June 2014 19:05, java8964 <java8964@hotmail.com> wrote:

> What your understanding is almost correct, but not with the part your
> highlighted.
>
> The HDFS is not designed for write performance, but the client doesn't
> have to wait for the acknowledgment of previous packets before sending the
> next packets.
>
> This webpage describes it clearly, and hope it is helpful for you.
>
> http://aosabook.org/en/hdfs.html
>
> Quoted
>
> The next packet can be pushed to the pipeline before receiving the
> acknowledgment for the previous packets. The number of outstanding packets
> is limited by the outstanding packets window size of the client.
>
> Do you have any requirements of performance of ingesting data into HDFS?
>
> Yong
>
> ------------------------------
> Date: Thu, 19 Jun 2014 11:51:43 +0530
> Subject: Re: HDFS File Writes & Reads
> From: vijay.bhoomireddy@gmail.com
> To: user@hadoop.apache.org
>
>
> @Zeshen Wu,Thanks for the response.
>
> I still don't understand how HDFS reduces the time to write and read a
> file, compared to a traditional file read / write mechanism.
>
> For example, if I am writing a file, using the default configurations,
> Hadoop internally has to write each block to 3 data nodes. My understanding
> is that for each block, first the client writes the block to the first data
> node in the pipeline which will then inform the second and so on. Once the
> third data node successfully receives the block, it provides an
> acknowledgement back to data node 2 and finally to the client through Data
> node 1. *Only after receiving the acknowledgement for the block, the
> write is considered successful and the client proceeds to write the next
> block.*
>
> If this is the case, then the time taken to write each block is 3 times
> than the normal write due to the replication factor and the write process
> is happening sequentially block after block.
>
> Please correct me if I am wrong in my understanding. Also, the following
> questions below:
>
> 1. My understanding is that File read / write in Hadoop doesn't have any
> parallelism and the best it can perform is same to a traditional file read
> or write + some overhead involved in the distributed communication
> mechanism.
> 2. Parallelism is provided only during the data processing phase via Map
> Reduce, but not during file read / write by a client.
>
> Regards
> Vijay
>
>
>
> On 17 June 2014 19:37, Zesheng Wu <wuzesheng86@gmail.com> wrote:
>
> 1. HDFS doesn't allow parallel write
> 2. HDFS use pipeline to write multiple replicas, so it doesn't take three
> times more time than a traditional file write
> 3. HDFS allow parallel read
>
>
> 2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bhoomi Reddy <
> vijay.bhoomireddy@gmail.com>:
>
> Hi,
>
> I have a basic question regarding file writes and reads in HDFS. Is the
> file write and read process a sequential activity or executed in parallel?
>
> For example, lets assume that there is a File File1 which constitutes of
> three blocks B1, B2 and B3.
>
> 1. Will the write process write B2 only after B1 is complete and B3 only
> after B2 is complete or for a large file with many blocks, can this happen
> in parallel? In all the hadoop documentation, I read this to be a
> sequential operation. Does that mean for a file of 1TB, it takes three
> times more time than a traditional file write? (due to default replication
> factor of 3)
> 2. Is it similar in the case of read as well?
>
> Kindly someone please provide some clarity on this...
>
> Regards
> Vijay
>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>
>
>

--001a11c3c9e02f1b0404fc3e21b5
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Yong,<div><br></div><div>Thanks for the clarification. It =
was more of an academic query. We do not have any performance requirements =
at this stage.</div><div><br></div><div>Regards</div><div>Vijay</div></div>
<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On 19 June 20=
14 19:05, java8964 <span dir=3D"ltr">&lt;<a href=3D"mailto:java8964@hotmail=
.com" target=3D"_blank">java8964@hotmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex">


<div><div dir=3D"ltr">What your understanding is almost correct, but not wi=
th the part your highlighted.<div><br></div><div>The HDFS is not designed f=
or write performance, but the client doesn&#39;t have to wait for the ackno=
wledgment of previous packets before sending the next packets.</div>
<div><br></div><div>This webpage describes it clearly, and hope it is helpf=
ul for you.</div><div><br></div><div><a href=3D"http://aosabook.org/en/hdfs=
.html" target=3D"_blank">http://aosabook.org/en/hdfs.html</a></div><div><br=
>
</div><div>Quoted</div><div><br></div><div><span style=3D"color:rgb(51,51,5=
1);font-family:&#39;Helvetica Neue&#39;,Helvetica,Arial,sans-serif;font-siz=
e:12.800000190734863px;line-height:18px;background-color:rgb(255,255,255)">=
The next packet can be pushed to the pipeline before receiving the acknowle=
dgment for the previous packets. The number of outstanding packets is limit=
ed by the outstanding packets window size of the client.</span></div>
<div><font color=3D"#333333" face=3D"Helvetica Neue, Helvetica, Arial, sans=
-serif"><span style=3D"line-height:18px"><br></span></font></div><div><font=
 color=3D"#333333" face=3D"Helvetica Neue, Helvetica, Arial, sans-serif"><s=
pan style=3D"line-height:18px">Do you have any requirements of performance =
of ingesting data into HDFS?</span></font></div>
<div><font color=3D"#333333" face=3D"Helvetica Neue, Helvetica, Arial, sans=
-serif"><span style=3D"line-height:18px"><br></span></font></div><div><font=
 color=3D"#333333" face=3D"Helvetica Neue, Helvetica, Arial, sans-serif"><s=
pan style=3D"line-height:18px">Yong<br>
</span></font><br><div><hr>Date: Thu, 19 Jun 2014 11:51:43 +0530<br>Subject=
: Re: HDFS File Writes &amp; Reads<br>From: <a href=3D"mailto:vijay.bhoomir=
eddy@gmail.com" target=3D"_blank">vijay.bhoomireddy@gmail.com</a><br>To: <a=
 href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apach=
e.org</a><div>
<div class=3D"h5"><br><br><div dir=3D"ltr"><div>@Zeshen Wu,Thanks for the r=
esponse.</div><div><br></div><div>I still don&#39;t understand how HDFS red=
uces the time to write and read a file, compared to a traditional file read=
 / write mechanism.=C2=A0</div>

<div><br></div><div>For example, if I am writing a file, using the default =
configurations, Hadoop internally has to write each block to 3 data nodes. =
My understanding is that for each block, first the client writes the block =
to the first data node in the pipeline which will then inform the second an=
d so on. Once the third data node successfully receives the block, it provi=
des an acknowledgement back to data node 2 and finally to the client throug=
h Data node 1. <b>Only after receiving the acknowledgement for the block, t=
he write is considered successful and the client proceeds to write the next=
 block.</b></div>

<div><b><br></b></div><div>If this is the case, then the time taken to writ=
e each block is 3 times than the normal write due to the replication factor=
 and the write process is happening sequentially block after block.</div>

<div><br></div><div>Please correct me if I am wrong in my understanding. Al=
so, the following questions below:</div><div><br></div><div>1. My understan=
ding is that File read / write in Hadoop doesn&#39;t have any parallelism a=
nd the best it can perform is same to a traditional file read or write + so=
me overhead involved in the distributed communication mechanism.</div>

<div>2. Parallelism is provided only during the data processing phase via M=
ap Reduce, but not during file read / write by a client.</div><div><br></di=
v><div>Regards</div><div>Vijay</div><div><b><br></b></div></div><div>
<br><br><div>On 17 June 2014 19:37, Zesheng Wu <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:wuzesheng86@gmail.com" target=3D"_blank">wuzesheng86@gmail.com<=
/a>&gt;</span> wrote:<br><blockquote style=3D"border-left:1px #ccc solid;pa=
dding-left:1ex">

<div dir=3D"ltr">1. HDFS doesn&#39;t allow parallel write<div>2. HDFS use p=
ipeline to write multiple replicas, so it doesn&#39;t take three times more=
 time than a traditional file write</div><div>3. HDFS allow parallel read</=
div>


</div><div><br><br><div>2014-06-17 19:17 GMT+08:00 Vijaya Narayana Reddy Bh=
oomi Reddy <span dir=3D"ltr">&lt;<a href=3D"mailto:vijay.bhoomireddy@gmail.=
com" target=3D"_blank">vijay.bhoomireddy@gmail.com</a>&gt;</span>:<div>
<div><br>

<blockquote style=3D"border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr"><span style=3D"font-family:arial,sans-serif;font-size:13px">Hi,</s=
pan><div style=3D"font-family:arial,sans-serif;font-size:13px">


<br></div><div style=3D"font-family:arial,sans-serif;font-size:13px">I have=
 a basic question regarding file writes and reads in HDFS. Is the file writ=
e and read process a sequential activity or executed in parallel?</div>
<div style=3D"font-family:arial,sans-serif;font-size:13px"><br></div><div s=
tyle=3D"font-family:arial,sans-serif;font-size:13px">For example, lets assu=
me that there is a File File1 which constitutes of three blocks B1, B2 and =
B3.=C2=A0</div>


<div style=3D"font-family:arial,sans-serif;font-size:13px"><br></div><div s=
tyle=3D"font-family:arial,sans-serif;font-size:13px">1. Will the write proc=
ess write B2 only after B1 is complete and B3 only after B2 is complete or =
for a large file with many blocks, can this happen in parallel? In all the =
hadoop documentation, I read this to be a sequential operation. Does that m=
ean for a file of 1TB, it takes three times more time than a traditional fi=
le write? (due to default replication factor of 3)</div>


<div style=3D"font-family:arial,sans-serif;font-size:13px">2. Is it similar=
 in the case of read as well?</div><div style=3D"font-family:arial,sans-ser=
if;font-size:13px"><br></div><div style=3D"font-family:arial,sans-serif;fon=
t-size:13px">


Kindly someone please provide some clarity on this...</div><div style=3D"fo=
nt-family:arial,sans-serif;font-size:13px"><br></div><div style=3D"font-fam=
ily:arial,sans-serif;font-size:13px">Regards</div><div style=3D"font-family=
:arial,sans-serif;font-size:13px">


Vijay</div></div>
</blockquote></div></div></div><span><font color=3D"#888888"><br><br clear=
=3D"all"><div><br></div>-- <br>Best Wishes!<br><br>Yours, Zesheng
</font></span></div>
</blockquote></div><br></div></div></div></div></div> 		 	   		  </div></di=
v>
</blockquote></div><br></div>

--001a11c3c9e02f1b0404fc3e21b5--