hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Sirianni (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-5380) NameNode returns stale block locations to clients during excess replica pruning
Date Thu, 17 Oct 2013 16:20:41 GMT

     [ https://issues.apache.org/jira/browse/HDFS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Eric Sirianni updated HDFS-5380:

    Attachment: ExcessReplicaPruningTest.java

JUnit test that demonstrates this issue using {{MiniDFSCluster}}

> NameNode returns stale block locations to clients during excess replica pruning
> -------------------------------------------------------------------------------
>                 Key: HDFS-5380
>                 URL: https://issues.apache.org/jira/browse/HDFS-5380
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.0.0-alpha, 1.2.1
>            Reporter: Eric Sirianni
>            Priority: Minor
>         Attachments: ExcessReplicaPruningTest.java
> Consider the following contrived example:
> {code}
> // Step 1: Create file with replication factor = 2
> Path path = ...;
> short replication = 2;
> OutputStream os = fs.create(path, ..., replication, ...);
> // Step 2: Write to file
> os.write(...);
> // Step 3: Reduce replication factor to 1
> fs.setReplication(path, 1);
> // Wait for namenode to prune excess replicates
> // Step 4: Read from file
> InputStream is = fs.open(path);
> is.read(...);
> {code}
> During the read in _Step 4_, the {{DFSInputStream}} client receives "stale" block locations
from the NameNode.  Specifically, it receives block locations that the NameNode has already
pruned/invalidated (and the DataNodes have already deleted).
> The net effect of this is unnecessary churn in the {{DFSClient}} (timeouts, retries,
extra RPCs, etc.).  In particular:
> {noformat}
> WARN  hdfs.DFSClient - Failed to connect to datanode-1 for block, add to deadNodes and
> {noformat}
> The blacklisting of DataNodes that are, in fact, functioning properly can lead to inefficient
locality of reads.  Since the blacklist is _cumulative_ across all blocks in the file, this
can have noticeable impact for large files.
> A pathological case can occur when *all* block locations are in the blacklist.  In this
case, the {{DFSInputStream}} will sleep and refetch locations from the NameNode, causing unnecessary
RPCs and a client-side sleep:  
> {noformat}
> INFO  hdfs.DFSClient - Could not obtain blk_1073741826_1002 from any node: java.io.IOException:
No live nodes contain current block. Will get new block locations from namenode and retry...
> {noformat}
> This pathological case can occur in the following example (for a read of file {{foo}}):
> # {{DFSInputStream}} attempts to read block 1 of {{foo}}.
> # Gets locations: {{( dn1(stale), dn2 )}}
> # Attempts read from {{dn1}}.  Fails.  Adds {{dn1}} to blacklist.
> # {{DFSInputStream}} attempts to read block 2 of {{foo}}.
> # Gets locations: {{( dn1, dn2(stale) )}}
> # Attempts read from {{dn2}} ({{dn1}} already blacklisted).  Fails.  Adds {{dn1}} to
> # All locations for block 2 are now in blacklist.
> # Clears blacklists
> # Sleeps up to 3 seconds
> # Refetches locations from the NameNode
> A solution would be to change the NameNode to not return stale block locations to clients
for replicas that it knows it has asked DataNodes to invalidate.
> A quick look at the {{BlockManager.chooseExcessReplicates()}} code path seems to indicate
that the NameNode does not actually remove the pruned replica from the BlocksMap until the
subsequent blockReport is received.  This can leave a substantial window where the NameNode
can return stale replica locations to clients.  
> If the NameNode were to proactively update the {{BlocksMap}} upon excess replica pruning,
this situation could be avoided.  If the DataNode did not in fact invalidate the replica as
asked, the NameNode would simply re-add the replica to the {{BlocksMap}} upon next blockReport
and go through the pruning exercise again.

This message was sent by Atlassian JIRA

View raw message