Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of reena2485@outlook.com
 designates 65.55.90.98 as permitted sender)
Message-ID: <SNT405-EAS653E5E026844180580D343AA600@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_f4067af5-3a18-4033-b1ff-de2937ce3774_"
MIME-Version: 1.0
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
From: reena upadhyay <reena2485@outlook.com>
Subject: RE: How check sum are generated for blocks in data node
Date: Mon, 31 Mar 2014 00:19:28 +0530

--_f4067af5-3a18-4033-b1ff-de2937ce3774_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Thank you so much for helping me in understanding the concept of checksum

Sent from my Windows Phone
________________________________
From: Wellington Chevreuil<mailto:wellington.chevreuil@gmail.com>
Sent: =E2=80=8E29-=E2=80=8E03-=E2=80=8E2014 00:12
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: How check sum are generated for blocks in data node

Hi Reena=2C

the pipeline is per block. If you have half of your file in data node A onl=
y=2C that means the pipeline had only one node (node A=2C in this case=2C p=
robably because replication factor is set to 1) and then=2C data node A has=
 the checksums for its block. The same applies to data node B.

All nodes will have checksums for the blocks they own. Checksums is passed =
together with the block=2C as it goes through the pipeline=2C but as the la=
st node on the pipeline receives the original checksums along with the bloc=
k from previous nodes=2C its only needed to make the validation on this las=
t one=2C because if it passes there=2C it means the file was not corrupted =
in any of the previous nodes as well.

Cheers.

On 28 Mar 2014=2C at 10:28=2C reena upadhyay <reena2485@outlook.com> wrote:

> I was going through this link http://stackoverflow.com/questions/9406477/=
data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum . Its written=
 that in recent version of hadoop only the last data node verifies the chec=
ksum as the write happens in a pipeline fashion.
> Now I have a question:
> Assuming my cluster has two data nodes A and B cluster=2C I have a file=
=2C half of the file content is written on first data node A and the other =
remaining half is written on the second data node B to take advantage of pa=
rallelism.  My question is:  Will data node A will not store the check sum =
for the blocks stored on it.
>
> As per the line "only the last data node verifies the checksum"=2C it loo=
ks like only the  last data node in my case it will be data node B=2C will =
generate the checksum. But if only data node B generates checksum=2C then i=
t will generate the check sum only for the blocks stored on data node B. Wh=
at about the checksum for the data blocks on data node  machine A?


--_f4067af5-3a18-4033-b1ff-de2937ce3774_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="utf-8"

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html=3B charset=3Dutf-8">
</head>
<body>
<div>
<div style=3D"FONT-SIZE: 11pt=3B FONT-FAMILY: Calibri=2Csans-serif">Thank y=
ou so much for helping me in understanding the concept of checksum
<br>
<br>
Sent from my Windows Phone</div>
</div>
<div dir=3D"ltr">
<hr>
<span style=3D"FONT-SIZE: 11pt=3B FONT-FAMILY: Calibri=2Csans-serif=3B FONT=
-WEIGHT: bold">From:
</span><span style=3D"FONT-SIZE: 11pt=3B FONT-FAMILY: Calibri=2Csans-serif"=
><a href=3D"mailto:wellington.chevreuil@gmail.com">Wellington Chevreuil</a>=
</span><br>
<span style=3D"FONT-SIZE: 11pt=3B FONT-FAMILY: Calibri=2Csans-serif=3B FONT=
-WEIGHT: bold">Sent:
</span><span style=3D"FONT-SIZE: 11pt=3B FONT-FAMILY: Calibri=2Csans-serif"=
>=E2=80=8E29-=E2=80=8E03-=E2=80=8E2014 00:12</span><br>
<span style=3D"FONT-SIZE: 11pt=3B FONT-FAMILY: Calibri=2Csans-serif=3B FONT=
-WEIGHT: bold">To:
</span><span style=3D"FONT-SIZE: 11pt=3B FONT-FAMILY: Calibri=2Csans-serif"=
><a href=3D"mailto:user@hadoop.apache.org">user@hadoop.apache.org</a></span=
><br>
<span style=3D"FONT-SIZE: 11pt=3B FONT-FAMILY: Calibri=2Csans-serif=3B FONT=
-WEIGHT: bold">Subject:
</span><span style=3D"FONT-SIZE: 11pt=3B FONT-FAMILY: Calibri=2Csans-serif"=
>Re: How check sum are generated for blocks in data node</span><br>
<br>
</div>
<div style=3D"word-wrap:break-word">Hi Reena=2C
<div><br>
</div>
<div>the pipeline is per block. If you have half of your file in data node =
A only=2C that means the pipeline had only one node (node A=2C in this case=
=2C probably because replication factor is set to 1) and then=2C data node =
A has the checksums for its block. The same
 applies to data node B. &nbsp=3B</div>
<div><br>
</div>
<div>All nodes will have checksums for the blocks they own. Checksums is pa=
ssed together with the block=2C as it goes through the pipeline=2C but as t=
he last node on the pipeline receives the original checksums along with the=
 block from previous nodes=2C its only
 needed to make the validation on this last one=2C because if it passes the=
re=2C it means the file was not corrupted in any of the previous nodes as w=
ell.</div>
<div><br>
</div>
<div>Cheers.</div>
<div><br>
<div>
<div>On 28 Mar 2014=2C at 10:28=2C reena upadhyay &lt=3B<a href=3D"mailto:r=
eena2485@outlook.com">reena2485@outlook.com</a>&gt=3B wrote:</div>
<br class=3D"x_Apple-interchange-newline">
<blockquote type=3D"cite">
<div class=3D"x_hmmessage" style=3D"font-size:12pt=3B font-family:Calibri=
=3B font-style:normal=3B font-variant:normal=3B font-weight:normal=3B lette=
r-spacing:normal=3B line-height:normal=3B orphans:auto=3B text-align:start=
=3B text-indent:0px=3B text-transform:none=3B white-space:normal=3B widows:=
auto=3B word-spacing:0px">
<div dir=3D"ltr">I was going through this link<span class=3D"x_Apple-conver=
ted-space">&nbsp=3B</span><a href=3D"http://stackoverflow.com/questions/940=
6477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksu" target=
=3D"_blank">http://stackoverflow.com/questions/9406477/data-integrity-in-hd=
fs-which-data-nodes-verifies-the-checksu</a>m
 . Its written that in recent version of hadoop only the last data node ver=
ifies the checksum as the write happens in a pipeline fashion.<span class=
=3D"x_Apple-converted-space">&nbsp=3B</span><br>
Now I have a question:<br>
Assuming my cluster has two data nodes A and B cluster=2C I have a file=2C =
half of the file content is written on first<span class=3D"x_Apple-converte=
d-space">&nbsp=3B</span><b>data node</b><span class=3D"x_Apple-converted-sp=
ace">&nbsp=3B</span><b>A</b><span class=3D"x_Apple-converted-space">&nbsp=
=3B</span>and
 the other remaining half is written on the second<span class=3D"x_Apple-co=
nverted-space">&nbsp=3B</span><b>data node B</b><span class=3D"x_Apple-conv=
erted-space">&nbsp=3B</span>to take advantage of parallelism.&nbsp=3B My qu=
estion is:&nbsp=3B Will<b><span class=3D"x_Apple-converted-space">&nbsp=3B<=
/span>data
 node A</b><span class=3D"x_Apple-converted-space">&nbsp=3B</span>will not =
store the check sum for the blocks stored on it.<span class=3D"x_Apple-conv=
erted-space">&nbsp=3B</span><br>
<br>
As per the line &quot=3Bonly the last data node verifies the checksum&quot=
=3B=2C it looks like only the&nbsp=3B last data node in my case it will be<=
span class=3D"x_Apple-converted-space">&nbsp=3B</span><b>data node B</b>=2C=
 will generate the checksum. But if only<span class=3D"x_Apple-converted-sp=
ace">&nbsp=3B</span><b>data
 node B<span class=3D"x_Apple-converted-space">&nbsp=3B</span></b>generates=
 checksum=2C then it will generate the check sum only for the blocks stored=
 on<span class=3D"x_Apple-converted-space">&nbsp=3B</span><b>data node B</b=
>. What about the checksum for the data blocks on<span class=3D"x_Apple-con=
verted-space">&nbsp=3B</span><b>data
 node&nbsp=3B machine A</b>?</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</body>
</html>

--_f4067af5-3a18-4033-b1ff-de2937ce3774_--