hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2832) Enable support for heterogeneous storages in HDFS
Date Tue, 05 Nov 2013 09:02:29 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813784#comment-13813784
] 

Konstantin Shvachko commented on HDFS-2832:
-------------------------------------------

> UUID#randomUUID generates RFC-4122 compliant UUIDs which are unique *for all practical
purposes*

RFC-4122 has a special note about "distributed applications". But let's just think about it
in general. 
randomUUID is based on pseudo random sequence of numbers, which is like a Mobius Strip or
just a loop. It actually works well if you generate IDs on a single node, because the sequence
lasts long without repetitions. In our case we initiate thousands of pseudo random sequences
(one per node), each starting from a random number. Let's mark those starting numbers on the
Mobius Strip or the loop. Then we actually decreased the probability of uniqueness because
now in order to get a collision one of the nodes need to reach the starting point of another
node, rather than going all around the loop. So in  distributed environment we increase the
probability of collision with each new node added. And when you add more storage types per
node you further increase the collision probability.
"for all practical purposes" as I understand it in the case means that probability of non-unique
IDs is low. But it does not mean impossible. The consequences of a storageID collision are
pretty bad, hard to detect and recover. At the same time {{DataNode.createNewStorageId()}}
generates unique IDs as of today. Why changing it to a problematic approach?

> Part of the rationale is in HDFS-5115. Making them UUIDs simplifies the generation logic.

Looks like HDFS-5115 was based on an incomplete assumption:
bq. The Storage ID is currently generated from the DataNode's IP+Port+Random components
while in fact it also includes currentTime, which guarantees the uniqueness of ids generated
on the same node, unless somebody resets the machine clock to the past.

> Enable support for heterogeneous storages in HDFS
> -------------------------------------------------
>
>                 Key: HDFS-2832
>                 URL: https://issues.apache.org/jira/browse/HDFS-2832
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>    Affects Versions: 0.24.0
>            Reporter: Suresh Srinivas
>            Assignee: Suresh Srinivas
>         Attachments: 20130813-HeterogeneousStorage.pdf, h2832_20131023.patch, h2832_20131023b.patch,
h2832_20131025.patch, h2832_20131028.patch, h2832_20131028b.patch, h2832_20131029.patch, h2832_20131103.patch,
h2832_20131104.patch
>
>
> HDFS currently supports configuration where storages are a list of directories. Typically
each of these directories correspond to a volume with its own file system. All these directories
are homogeneous and therefore identified as a single storage at the namenode. I propose, change
to the current model where Datanode * is a * storage, to Datanode * is a collection * of strorages.




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message