hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4633) TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly
Date Mon, 25 Mar 2013 19:53:14 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613026#comment-13613026
] 

Chris Nauroth commented on HDFS-4633:
-------------------------------------

Here are some additional details.  There is a bad interaction between the 1-second cache expiration
used by {{TestDFSClientExcludedNodes#testExcludedNodesForgiveness}} and the exclusion/retry
logic within {{DFSOutputStream#DataStreamer#nextBlockOutputStream}}.  Here is the sequence
of events I observed during a failed test run.  Assume 3 data nodes named dn1, dn2, and dn3.

# DFSOutputStream writes first block to [dn1, dn2, dn3].
# Test stops data nodes [dn1, dn2].
# DFSOutputStream attempts writing second block to [dn1, dn2, dn3].  It fails to dn1 and marks
it excluded.
# DFSOutputStream retries and attempts writing second block to [dn2, dn3].  It fails to dn2
and marks it excluded.
# DFSOutputStream retries, but by now, > 1 second has elapsed since dn1 failed.  dn1 gets
evicted from the cache and it attempts writing second block to [dn1, dn3].  This fails again,
so it marks dn1 excluded again.
# DFSOutputStream retries, but by now, > 1 second has elapsed since dn2 failed.  dn2 gets
evicted from the cache and it attempts writing second block to [dn2, dn3].  This fails again,
so it marks dn2 excluded again.
# At this point, {{DFSOutputStream#DataStreamer#nextBlockOutputStream}} has exceeded max block
write retries (3).  It aborts and throws {{IOException}} with "Unable to create new block.".

                
> TestDFSClientExcludedNodes fails sporadically if excluded nodes cache expires too quickly
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-4633
>                 URL: https://issues.apache.org/jira/browse/HDFS-4633
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client, test
>    Affects Versions: 3.0.0
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>
> {{TestDFSClientExcludedNodes}} simulates failures of individual data nodes in the client's
write pipeline and checks the client's ability to recover.  HDFS-4246 added support for periodic
"forgiveness" by caching the list of known bad data nodes with a periodic eviction.  The test
uses a 1 second cache expiration.  This sometimes causes failed nodes to be forgiven too fast
and violate the assumptions of the test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message