hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Sirianni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5318) Support read-only and read-write paths to shared replicas
Date Wed, 22 Jan 2014 21:52:21 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879232#comment-13879232

Eric Sirianni commented on HDFS-5318:

Arpit - thanks for the quick feedback.

Unless otherwise specified, I have made the changes as you suggested.

bq. I think it would be a good idea to rename READ_ONLY to READ_ONLY_SHARED everywhere (including
comments) and document the meaning. What do you think?
I agree, and have made the change in my new patch.

bq. Remove lines with whitespace-only changes.
I only saw one such instance (in {{SimulatedFsDataset}}) and have corrected it.

bq. I am not sure if this change affects lease recovery. It would be good if someone familiar
with how lease recovery works can take a quick look at the approach. We don't want read-only
replicas to participate in recovery. What if all read-write replicas are unavailable?
In our {{FsDatasetSpi}} plugin implementation, replicas are not exposed from {{READ_ONLY_SHARED}}
storages until they are _finalized_ (by "exposed" I mean returned from incremental and full
block reports).  This skirts many issues around lease recovery.  However:
* The NameNode should not rely on this implementation detail of our {{FsDatasetSpi}} plugin
* Finalized replicas can still participate in certain lease recovery scenarios

I looked a bit into how the recovery locations are chosen.  The locations used in lease recovery
come from {{BlockInfoUnderConstruction.getExpectedStorageLocations()}}.  That list is potentially
updated with {{READ_ONLY_SHARED}} locations when a {{BLOCK_RECEIVED}} is processed from a
{{READ_ONLY_SHARED}} storage.  

It is fairly simple to exclude read-only replicas when {{DatanodeManager}} builds the {{BlockRecoveryCommand}}
(much like stale locations are handled).  However, (as you point out), it's unclear what to
do here if there are *zero* read-write replicas available.  I believe the desired behavior
is for recovery to fail and the block be discarded (we can't really make any better guarantees
due to lack of fsync()d data available on read/only nodes).  I think we would need to shortcircuit
recovery earlier in this case (perhaps in {{FSNamesystem.internalReleaseLease()}}) since it's
probably too late by the time {{DatanodeManager}} builds the {{BlockRecoveryCommand}}.  Am
I wrong here?

Another alternative is to skip read/only storages in {{BlockInfoUnderConstruction.addReplicaIfNotPresent}}
(though this may have the undesired effect of no-opping block-received messages from read/only
storages...  Thoughts?

> Support read-only and read-write paths to shared replicas
> ---------------------------------------------------------
>                 Key: HDFS-5318
>                 URL: https://issues.apache.org/jira/browse/HDFS-5318
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.4.0
>            Reporter: Eric Sirianni
>         Attachments: HDFS-5318.patch, HDFS-5318a-branch-2.patch, HDFS-5318b-branch-2.patch,
> There are several use cases for using shared-storage for datanode block storage in an
HDFS environment (storing cold blocks on a NAS device, Amazon S3, etc.).
> With shared-storage, there is a distinction between:
> # a distinct physical copy of a block
> # an access-path to that block via a datanode.  
> A single 'replication count' metric cannot accurately capture both aspects.  However,
for most of the current uses of 'replication count' in the Namenode, the "number of physical
copies" aspect seems to be the appropriate semantic.
> I propose altering the replication counting algorithm in the Namenode to accurately infer
distinct physical copies in a shared storage environment.  With HDFS-5115, a {{StorageID}}
is a UUID.  I propose associating some minor additional semantics to the {{StorageID}} - namely
that multiple datanodes attaching to the same physical shared storage pool should report the
same {{StorageID}} for that pool.  A minor modification would be required in the DataNode
to enable the generation of {{StorageID}} s to be pluggable behind the {{FsDatasetSpi}} interface.
> With those semantics in place, the number of physical copies of a block in a shared storage
environment can be calculated as the number of _distinct_ {{StorageID}} s associated with
that block.
> Consider the following combinations for two {{(DataNode ID, Storage ID)}} pairs {{(DN_A,
S_A) (DN_B, S_B)}} for a given block B:
> * {{DN_A != DN_B && S_A != S_B}} - *different* access paths to *different* physical
replicas (i.e. the traditional HDFS case with local disks)
> ** &rarr; Block B has {{ReplicationCount == 2}}
> * {{DN_A != DN_B && S_A == S_B}} - *different* access paths to the *same* physical
replica (e.g. HDFS datanodes mounting the same NAS share)
> ** &rarr; Block B has {{ReplicationCount == 1}}
> For example, if block B has the following location tuples:
> * {{DN_1, STORAGE_A}}
> * {{DN_2, STORAGE_A}}
> * {{DN_3, STORAGE_B}}
> * {{DN_4, STORAGE_B}},
> the effect of this proposed change would be to calculate the replication factor in the
namenode as *2* instead of *4*.

This message was sent by Atlassian JIRA

View raw message