Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8669F11651 for ; Sun, 30 Mar 2014 18:50:19 +0000 (UTC) Received: (qmail 7420 invoked by uid 500); 30 Mar 2014 18:50:11 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 7154 invoked by uid 500); 30 Mar 2014 18:50:11 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 7146 invoked by uid 99); 30 Mar 2014 18:50:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Mar 2014 18:50:10 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of reena2485@outlook.com designates 65.55.90.98 as permitted sender) Received: from [65.55.90.98] (HELO snt0-omc2-s23.snt0.hotmail.com) (65.55.90.98) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Mar 2014 18:50:05 +0000 Received: from SNT405-EAS65 ([65.55.90.73]) by snt0-omc2-s23.snt0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Sun, 30 Mar 2014 11:49:44 -0700 X-TMN: [gJPXKsPIr0KYzJEhGaeCGloSD1oxjVG8] X-Originating-Email: [reena2485@outlook.com] Message-ID: Content-Type: multipart/alternative; boundary="_f4067af5-3a18-4033-b1ff-de2937ce3774_" MIME-Version: 1.0 To: "user@hadoop.apache.org" From: reena upadhyay Subject: RE: How check sum are generated for blocks in data node Date: Mon, 31 Mar 2014 00:19:28 +0530 X-OriginalArrivalTime: 30 Mar 2014 18:49:44.0188 (UTC) FILETIME=[D0D303C0:01CF4C48] X-Virus-Checked: Checked by ClamAV on apache.org --_f4067af5-3a18-4033-b1ff-de2937ce3774_ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Thank you so much for helping me in understanding the concept of checksum Sent from my Windows Phone ________________________________ From: Wellington Chevreuil Sent: =E2=80=8E29-=E2=80=8E03-=E2=80=8E2014 00:12 To: user@hadoop.apache.org Subject: Re: How check sum are generated for blocks in data node Hi Reena=2C the pipeline is per block. If you have half of your file in data node A onl= y=2C that means the pipeline had only one node (node A=2C in this case=2C p= robably because replication factor is set to 1) and then=2C data node A has= the checksums for its block. The same applies to data node B. All nodes will have checksums for the blocks they own. Checksums is passed = together with the block=2C as it goes through the pipeline=2C but as the la= st node on the pipeline receives the original checksums along with the bloc= k from previous nodes=2C its only needed to make the validation on this las= t one=2C because if it passes there=2C it means the file was not corrupted = in any of the previous nodes as well. Cheers. On 28 Mar 2014=2C at 10:28=2C reena upadhyay wrote: > I was going through this link http://stackoverflow.com/questions/9406477/= data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum . Its written= that in recent version of hadoop only the last data node verifies the chec= ksum as the write happens in a pipeline fashion. > Now I have a question: > Assuming my cluster has two data nodes A and B cluster=2C I have a file= =2C half of the file content is written on first data node A and the other = remaining half is written on the second data node B to take advantage of pa= rallelism. My question is: Will data node A will not store the check sum = for the blocks stored on it. > > As per the line "only the last data node verifies the checksum"=2C it loo= ks like only the last data node in my case it will be data node B=2C will = generate the checksum. But if only data node B generates checksum=2C then i= t will generate the check sum only for the blocks stored on data node B. Wh= at about the checksum for the data blocks on data node machine A? --_f4067af5-3a18-4033-b1ff-de2937ce3774_ Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8"
Thank y= ou so much for helping me in understanding the concept of checksum

Sent from my Windows Phone

From: Wellington Chevreuil=
Sent: =E2=80=8E29-=E2=80=8E03-=E2=80=8E2014 00:12
To: user@hadoop.apache.org
Subject: Re: How check sum are generated for blocks in data node

Hi Reena=2C

the pipeline is per block. If you have half of your file in data node = A only=2C that means the pipeline had only one node (node A=2C in this case= =2C probably because replication factor is set to 1) and then=2C data node = A has the checksums for its block. The same applies to data node B.  =3B

All nodes will have checksums for the blocks they own. Checksums is pa= ssed together with the block=2C as it goes through the pipeline=2C but as t= he last node on the pipeline receives the original checksums along with the= block from previous nodes=2C its only needed to make the validation on this last one=2C because if it passes the= re=2C it means the file was not corrupted in any of the previous nodes as w= ell.

Cheers.

On 28 Mar 2014=2C at 10:28=2C reena upadhyay <=3Breena2485@outlook.com>=3B wrote:

I was going through this link =3Bhttp://stackoverflow.com/questions/9406477/data-integrity-in-hd= fs-which-data-nodes-verifies-the-checksum . Its written that in recent version of hadoop only the last data node ver= ifies the checksum as the write happens in a pipeline fashion. =3B
Now I have a question:
Assuming my cluster has two data nodes A and B cluster=2C I have a file=2C = half of the file content is written on first =3Bdata node =3BA = =3Band the other remaining half is written on the second =3Bdata node B =3Bto take advantage of parallelism. =3B My qu= estion is: =3B Will =3B<= /span>data node A =3Bwill not = store the check sum for the blocks stored on it. =3B

As per the line "=3Bonly the last data node verifies the checksum"= =3B=2C it looks like only the =3B last data node in my case it will be<= span class=3D"x_Apple-converted-space"> =3Bdata node B=2C= will generate the checksum. But if only =3Bdata node B =3Bgenerates= checksum=2C then it will generate the check sum only for the blocks stored= on =3Bdata node B. What about the checksum for the data blocks on =3Bdata node =3B machine A?

--_f4067af5-3a18-4033-b1ff-de2937ce3774_--