hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From reena upadhyay <reena2...@outlook.com>
Subject RE: How check sum are generated for blocks in data node
Date Sun, 30 Mar 2014 18:49:28 GMT
Thank you so much for helping me in understanding the concept of checksum

Sent from my Windows Phone
From: Wellington Chevreuil<mailto:wellington.chevreuil@gmail.com>
Sent: ‎29-‎03-‎2014 00:12
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: How check sum are generated for blocks in data node

Hi Reena,

the pipeline is per block. If you have half of your file in data node A only, that means the
pipeline had only one node (node A, in this case, probably because replication factor is set
to 1) and then, data node A has the checksums for its block. The same applies to data node

All nodes will have checksums for the blocks they own. Checksums is passed together with the
block, as it goes through the pipeline, but as the last node on the pipeline receives the
original checksums along with the block from previous nodes, its only needed to make the validation
on this last one, because if it passes there, it means the file was not corrupted in any of
the previous nodes as well.


On 28 Mar 2014, at 10:28, reena upadhyay <reena2485@outlook.com> wrote:

> I was going through this link http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum
. Its written that in recent version of hadoop only the last data node verifies the checksum
as the write happens in a pipeline fashion.
> Now I have a question:
> Assuming my cluster has two data nodes A and B cluster, I have a file, half of the file
content is written on first data node A and the other remaining half is written on the second
data node B to take advantage of parallelism.  My question is:  Will data node A will not
store the check sum for the blocks stored on it.
> As per the line "only the last data node verifies the checksum", it looks like only the
 last data node in my case it will be data node B, will generate the checksum. But if only
data node B generates checksum, then it will generate the check sum only for the blocks stored
on data node B. What about the checksum for the data blocks on data node  machine A?

View raw message