hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-889) Possible race condition in BlocksMap.NodeIterator.
Date Fri, 14 May 2010 15:23:44 GMT

    [ https://issues.apache.org/jira/browse/HDFS-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867527#action_12867527
] 

Steve Loughran commented on HDFS-889:
-------------------------------------

>Is this just a test bug?
I don't know. We've only seen it in tests, but that doesn't mean that it hasn't happened out
in the field?

> Possible race condition in BlocksMap.NodeIterator.
> --------------------------------------------------
>
>                 Key: HDFS-889
>                 URL: https://issues.apache.org/jira/browse/HDFS-889
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.22.0
>            Reporter: Steve Loughran
>
> Hudson's test run for HDFS-165 is showing an NPE in {{org.apache.hadoop.hdfs.server.namenode.TestNodeCount.testNodeCount()}}
> One problem could be in {{BlocksMap.NodeIterator}}. It's {{hasNext()}} method checks
the next entry isn't null. But what if between the {{hasNext() call and the next() operation,
the map changes and an entry goes away? In that situation, the node returned from next() will
be null. 
> This is potentially serious as a quick look through the code shows that the iterator
gets retrieved a lot and everywhere hadoop does so, it assumes the value is not null. It's
also one of those problems that doesn't have a simple "make it go away" fix.
> Options
> # Ignore it, hope it doesn't happen very often and the test failing was a one off that
will never happen in a production datacentre. This is the default. The iterator is only used
in the namenode, so while it does depend on the # of datanodes, it isn't running in 4000 machines
in a big cluster.
> # Leave the iterator as is, have all the in-Hadoop code check for a null-value and break
the loop
> # Patch the {{NodeIterator}} to be consistent with the {{Iterator}} specification and
throw a {{NoSuchElementException}} if the next value is null. This does not make the problem
go away, but now it is handled by having every use in-Hadoop catching the exception at the
right point and exiting the loop. 
> Testing. This should be possible.
> # Create a block map
> # iterate over a block
> # while the iterator is in progress remove the next block in the list. Expect the next
call to next() to fail in whatever way you choose. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message