hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Sirianni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5318) Pluggable interface for replica counting
Date Fri, 20 Dec 2013 16:34:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854130#comment-13854130
] 

Eric Sirianni commented on HDFS-5318:
-------------------------------------

bq. Eric, your patch is implementing existing NN functionality via SPIs and not adding the
proposed duplicate replica detection correct?
Correct - using a plugin interface to support this was requested by Suresh.  I will attach
my current implementation of the SPI for reference.

bq. 1. Is the Jira description still accurate i.e. will the storages have the same StorageID
regardless of which DN exposes it? If this is true then there will be multiple DatanodeStorageInfo
objects with the same StorageID on the NameNode.
Yes, this is our current design and implementation.  The combination of ({{DatanodeUUID}},
{{StorageID}}) is a unique key.

bq. 2. Are you assuming that at most one DataNode will present the same physical storage as
being read-write? (It looks like you are)
Short term - yes.  The initial implementation of our plugin will only report the same physical
storage as read-write from one DataNode.
Longer term - no.  In order to support a resilient pipeline for append (i.e. multi-node pipeline
with repcount=1), we propose exposing the same storage as read-write via multiple DataNodes.
 As mentioned in HDFS-5434, this will require out-of-band coordination within our {{FsDataset}}
plugin to persist replicas to the shared storage in a consistent manner.

bq. 3. Which DataNode will be responsible for block reports/incremental block reports for
a given storage? I am not sure how block report processing will be affected.
All the DataNodes that share the storage send block reports for that storage.  This makes
the NameNode aware of the alternate read paths for the block, which it can then provide to
clients via {{getBlockLocations()}} to spread I/O load.

bq. 4. Will BlockInfo have one triplet for each logical replica? This means that the BlockInfo
will have multiple DatanodeStorage entries which have the same StorageID. I am not sure this
will work.
Yes.  This seems to work fairly robustly in our test environment.  Taking a cursory look at
the code, it appears that triplet lookup is done by {{DatanodeDescriptor}} and not {{StorageID}}:
{code:title=BlockInfo.java}
boolean addStorage(DatanodeStorageInfo) {
    int idx = findDatanode(storage.getDatanodeDescriptor());
}
{code}
I did note though that the initial sizing of the triplets array as {{3*replication}} will
be off in our case (that's OK since it will get resized when necessary later).

bq. 5. What is your intended behavior of getLocatedBlocks? Should one located block be exposed
for each logical replica?
Yes, the goal of replica sharing is to expose multiple DataNode read paths for a shared-replica
to clients.

bq. I am concerned that we have not caught all the locations in the NameNode code where you
will need to plug in, and catching them all will increase code complexity for the common case.
This is a fair concern.  Refactoring the {{BlockManager}} data structures to track physical
replicas _directly_ as "first class objects" instead of via the DataNode that exposes them
is a somewhat invasive change.  I think that it could be done in a manner that does not complicate
the common case.  The pluggable replica "normalizer" is a first step at that.  I guess the
questions are:
# Is having shared replicas something HDFS wants to eventually support _directly_ (example
use cases being non-local-disk blob storage backends like OpenStack Swift or Amazon S3)
# If not directly supported by HDFS, is having a pluggable interface that allows for shared
replica semantics worthwhile?

Deferring this decision/work is reasonable, but I would like to revisit the broader change
in the longer term (with a more comprehensive design doc and analysis).  Thoughts?

> Pluggable interface for replica counting
> ----------------------------------------
>
>                 Key: HDFS-5318
>                 URL: https://issues.apache.org/jira/browse/HDFS-5318
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.4.0
>            Reporter: Eric Sirianni
>         Attachments: HDFS-5318.patch, hdfs-5318.pdf
>
>
> There are several use cases for using shared-storage for datanode block storage in an
HDFS environment (storing cold blocks on a NAS device, Amazon S3, etc.).
> With shared-storage, there is a distinction between:
> # a distinct physical copy of a block
> # an access-path to that block via a datanode.  
> A single 'replication count' metric cannot accurately capture both aspects.  However,
for most of the current uses of 'replication count' in the Namenode, the "number of physical
copies" aspect seems to be the appropriate semantic.
> I propose altering the replication counting algorithm in the Namenode to accurately infer
distinct physical copies in a shared storage environment.  With HDFS-5115, a {{StorageID}}
is a UUID.  I propose associating some minor additional semantics to the {{StorageID}} - namely
that multiple datanodes attaching to the same physical shared storage pool should report the
same {{StorageID}} for that pool.  A minor modification would be required in the DataNode
to enable the generation of {{StorageID}} s to be pluggable behind the {{FsDatasetSpi}} interface.
 
> With those semantics in place, the number of physical copies of a block in a shared storage
environment can be calculated as the number of _distinct_ {{StorageID}} s associated with
that block.
> Consider the following combinations for two {{(DataNode ID, Storage ID)}} pairs {{(DN_A,
S_A) (DN_B, S_B)}} for a given block B:
> * {{DN_A != DN_B && S_A != S_B}} - *different* access paths to *different* physical
replicas (i.e. the traditional HDFS case with local disks)
> ** &rarr; Block B has {{ReplicationCount == 2}}
> * {{DN_A != DN_B && S_A == S_B}} - *different* access paths to the *same* physical
replica (e.g. HDFS datanodes mounting the same NAS share)
> ** &rarr; Block B has {{ReplicationCount == 1}}
> For example, if block B has the following location tuples:
> * {{DN_1, STORAGE_A}}
> * {{DN_2, STORAGE_A}}
> * {{DN_3, STORAGE_B}}
> * {{DN_4, STORAGE_B}},
> the effect of this proposed change would be to calculate the replication factor in the
namenode as *2* instead of *4*.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message