hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-378) DFSClient should track failures by block rather than globally
Date Fri, 15 Jan 2010 06:50:54 GMT

    [ https://issues.apache.org/jira/browse/HDFS-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800588#action_12800588

Todd Lipcon commented on HDFS-378:

Been thinking about this a bit tonight. It seems to me we have the following classes of errors
to deal with:

# A DN has died but the NN does not yet know about it. Thus, the client fails entirely to
connect to the DN. Ideally, the client shouldn't reconnect to this for quite some time.
# A DN is heavily loaded (above its max.xcievers value) and thus the client is rejected. But,
we'd like to retry it reasonably often, and ideally don't want to fail a read completely,
even if all replicas are in this state for a short period of time.
# A particular replica is corrupt or missing on a DN. Here, we just want to avoid reading
this particular block from this DN until the block has been rereplicated from a healthy copy.

Case #3 above is actually handled implicitly with no long-term/inter-operation tracking on
the client, since the client will report the bad block to the NN immediately upon discovering
it. Then, on the next getBlockLocations call for the same block, it will automatically be
filtered out of the LocatedBlocks result by the NN. When it's been fixed up, the new valid
location will end up in the LocatedBlocks result (whether on the same DN or a different one)

Given this, I disagree with the Description of this issue - I don't think the client needs
to track failures by block, just by datanode, as long as checksum failures are handled differently
than connection or timeout failures.

The remaining question is how to handle both case 1 and case 2 above in a convenient manner.
Here's one idea:

Whenever the client fails to read from a datanode, the timestamp of the failure is recorded
in a map keyed by node. When a block is to be read, the list of locations is sorted based
on ascending timestamp of last faillure - thus the nodes that have had problems least recently
are retried first. Any node with last failure past some threshold in the past (eg 5 minutes)
is considered to have never failed and is removed from the map. Any node that has no recorded
failure info should be prioritized above any node that does have failure info.

This should be fairly simple to implement without any protocol changes, and also easy to reason
about. The map would ideally be DFSClient-wide so applications that use a lot of separate
InputStreams won't use a lot of extra memory, and can share their view of the DN health.

One possible improvement on the above is to use datanode heartbeat times to distinguish between
case 1 and case 2. Specifically, a "relativeLastHeartbeat" field could be added to LocatedBlocks
for each datanode. The DN can then use this information to remove failure info for any DN
whose failures were recorded before the last heartbeat. Thus, it will retry heavily loaded
nodes once per heartbeat interval, but won't retry nodes that have actually failed. The downside
is that this would require a protocol change, and be harder to reason about for cases like
network partitions where a DN is heartbeating fine but some set of clients can't connect to

Looking forward to hearing people's thoughts.

> DFSClient should track failures by block rather than globally
> -------------------------------------------------------------
>                 Key: HDFS-378
>                 URL: https://issues.apache.org/jira/browse/HDFS-378
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Chris Douglas
> Rather than tracking the total number of times DFSInputStream failed to locate a datanode
for a particular block, such failures and the the list of datanodes involved should be scoped
to individual blocks. In particular, the "deadnode" list should be a map of blocks to a list
of failed nodes, the latter reset and the nodes retried per the existing semantics.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message