hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-6231) DFSClient hangs infinitely if using hedged reads and all eligible datanodes die.
Date Thu, 10 Apr 2014 20:56:16 GMT

     [ https://issues.apache.org/jira/browse/HDFS-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris Nauroth updated HDFS-6231:

    Attachment: HDFS-6231.1.patch

I found this problem from observing runs of {{TestPread}} that were hanging.  It turns out
that on most fast machines, {{TestPread}} doesn't actually end up triggering a hedged read.
 The initial read completes before the hedged read threshold, so we don't bother.  On one
of my slower VMs, I was seeing the test hang.  I was then able to repro even on my fast machines
by aggressively down-tuning the hedged read threshold.

Here is a patch to fix the bug.
# {{DFSInputStream#getFromOneDataNode}}: This was the main problem.  The returned {{Callable}}
needs to release a {{CountDownLatch}}, but it wasn't doing it in the failure case.  It was
only doing it in the success case.  I changed it to release the latch inside a finally clause.
# {{DFSInputStream#hedgedFetchBlockByteRange}}: After I applied the first change, it exposed
another problem here.  If all datanodes die, then we need to refetch block locations from
the datanode.  That wasn't happening, because this code used the condition {{futures == null}}
to decide whether or not to refetch block locations via a call to {{chooseDataNode}}.  After
a hedged read has been issued, {{futures}} is always non-null, so this wasn't sufficient.
 I changed the code to check for empty {{futures}}.  The reason this works is that {{getFirstToComplete}}
removes failed futures from the list.  This means that if all datanodes die, then {{futures}}
drops back to an empty list, and then we go into {{chooseDataNode}} to refetch block locations.
# In {{TestPread}}, I downtuned the hedged read threshold a lot so that this test really does
issue hedged reads even on fast machines.  That ought to help us catch regressions in the
future.  Now that hedged reads are really happening during the test runs, I found that I needed
to reset the metrics counts in order to satisfy some assertions.  This is required because
the metrics instance is static/global.

I've had multiple successful test runs of {{TestPread}} with this patch on both my fast Mac
and my slow Windows VM.

> DFSClient hangs infinitely if using hedged reads and all eligible datanodes die.
> --------------------------------------------------------------------------------
>                 Key: HDFS-6231
>                 URL: https://issues.apache.org/jira/browse/HDFS-6231
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 3.0.0, 2.4.0
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: HDFS-6231.1.patch
> When using hedged reads, and all eligible datanodes for the read get flagged as dead
or ignored, then the client is supposed to refetch block locations from the NameNode to retry
the read.  Instead, we've seen that the client can hang indefinitely.

This message was sent by Atlassian JIRA

View raw message