hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uma Maheswara Rao G <mahesw...@huawei.com>
Subject RE: Generation Stamp
Date Tue, 29 Nov 2011 17:18:39 GMT
Yes. :-)
From: kartheek muthyala [kartheek0274@gmail.com]
Sent: Tuesday, November 29, 2011 10:20 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Generation Stamp

Uma, first of all thanks for the detailed exemplified explanation.

So to confirm, the primary use of having this generationTimeStamp is to ensure consistency
of the block?. So, when the pipeline is failed at DN3, and the client invokes recovery, then
the NN will chose DN1 to complete the pipeline. The DN1 first updates its metafile with the
new time stamp, and then passes this information to the other replica at DN2. Further, in
the future NN sees that this particular block is under replicated and it assigns some other
DNa and asks either DN1/DN2 to replicate the same at DNa.


On Tue, Nov 29, 2011 at 8:10 PM, Uma Maheswara Rao G <maheswara@huawei.com<mailto:maheswara@huawei.com>>

Generationstamp is basically to keep track of the replica states.

 Consider one scenario where generation smap will be use:

  Create a file which has one block. client started writing that block to DN1, DN 2, DN3 (
pipeline )

After writing some data DN3 failed, then Client will get the exception about pipeline failuere.
Then Client will handle that exception ( you can see it in processDataNodeError in DataStreamer
thread) . It will remove DN3 and will call the recovery for that block with new generation
time stamp, then NN will choose one primary DN and assign block synchronization work.Then
primary DN will ensure that all the remainnng block lengths are same ( if require it will
truncate to consistant length) and will invoke committblckSynchronization. Then remaing datatransfer
will resume.

 now block will have new genartion timestamp. You can observe this in metadata file for that
block in DN.

now the block will be like blk_12345634444<tel:12345634444>, blk_12345634444<tel:12345634444>_1234.meta

here 1234 is the generation timestamp.

Assume a case, after resuming the write again, DN2 fails, then again recovery will starts
and will get new Generation time stamp again. now only DN1 in pipeline  and block is blk_12345634444<tel:12345634444>,
blk_12345634444<tel:12345634444>_1235.meta. resume the the remaing data writes and complted
the last packet. With the last packet blocks should be finalized. DN1 is finalized the block
successfully and DN1 will send blocks received command and block info will be updated in blocks
map . Assume if DN2 comes back and sending that old block in reports to NN. Here NN can find
that generation timestamp of that block is lesser than DN1 reported blocks genstamp. So, it
can take the decision now. it can reject the lesser generation time stamp block.

Yu can see this code in FSNameSystem#addStoredBlock.  ofcource there will be many conditions
like length mismatch..etc

Hope it will help you....



From: kartheek muthyala [kartheek0274@gmail.com<mailto:kartheek0274@gmail.com>]
Sent: Tuesday, November 29, 2011 7:44 PM
To: hdfs-user
Subject: Generation Stamp

Hi all,
Why is there the concept of Generation Stamp that is getting tagged to the metadata of the
block.? How is it useful? I have seen that in the hdfs current directory, the metafiles are
tagged with this generation stamp. Does this keep track of the versioning?

View raw message