hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-898) Sequential generation of block ids
Date Thu, 14 Jan 2010 02:24:54 GMT

    [ https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800083#action_12800083

Konstantin Shvachko commented on HDFS-898:

h3. Problem
Currently HDFS generates ids for new blocks by randomly selecting a 64-bit number and verifying
this id is not in the system yet. If the id is assigned to one of the blocks the procedure
repeats again.
This was discussed in several issues before: HADOOP-2656, HADOOP-146, HADOOP-158.
The problem with this approach is that data-nodes that had been offline for a long time getting
back online may bring in blocks that have already been eleased by the system and then their
ids were regenerated and reused again for different blocks in different files. These are so
called prehistoric blocks, see [here|https://issues.apache.org/jira/browse/HADOOP-2656?focusedCommentId=12590551&comment-tabpanel#action_12590551].
Bringing up a data-node with a prehistoric replica can potentially corrupt the block.

h3. Motivation
# The prehistoric block problem is still relevant, since the name-node's block-map keys blocks
by their ids, see HDFS-512. Although now there is less chance to corrupt a block with a stale
replica, because stale replicas will have older generation stamps.
# Non-reusable block ids may let us convert generation stamps to 32-bit numbers. Currently
generation stamp is 64-bit and is a global file system variable. If we are sure block ids
are not reusable, then we can implement per-file generation stamps and 32-bit will be enough
for that.
# A forward looking reason for turning into sequential ids is that if we ever introduce multiple
or distributed name-nodes, then it will not be feasible to verify the existence of a particular
block id across these name-nodes. 

h3. Solution
This simple solution was born some time ago in a discussion with Nicholas and Rob.
Suppose that you have a cluster with 64 million blocks (N = 2 ^26^). Block ids are 64-bit
numbers there is 2 ^64^ of such numbers. If I order the existing block ids b ~1~ , ..., b
~N~ , where b ~i~ < b ~i+1~ . Then each i defines a contiguous segment (b ~i~ , b ~i+1~
) of numbers, which does not contain other block ids inside it. It is easy to see that at
least one segment will be of size 2 ^38^ = 2 ^64^ / 2 ^26^ .
So the proposal is to find such a segment and use it for generating block ids starting from
b ~i~ + 1 by incrementing the previous generation stamp until it reaches b ~i+1~ .
Currently we increment the generation stamp every time the file is created or a block write
fails. Suppose we do 200 new-generation-stamps per second. This is rather optimistic: I think
in practice the number is lower. With that pace we can keep generating new stamps for 43 years.
Then we will find another large gap in what will remain out of those 43-year-old blocks and
start using this new segment. I don't think anybody in this community will care what may happen
after the third segment is comsumed, but HDFS will keep looking for new segments and the expectation
is that very few blocks will outlive their creators.

As the next step I'll look at some real HDFS images and verify that practice confirms the

> Sequential generation of block ids
> ----------------------------------
>                 Key: HDFS-898
>                 URL: https://issues.apache.org/jira/browse/HDFS-898
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>    Affects Versions: 0.20.1
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>             Fix For: 0.22.0
> This is a proposal to replace random generation of block ids with a sequential generator
in order to avoid block id reuse in the future.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message