hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1257) Race condition introduced by HADOOP-5124
Date Mon, 27 Jun 2011 15:43:47 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055606#comment-13055606

Eric Payne commented on HDFS-1257:

What is the status of this Jira?

I believe that I am also running into this issue. I am using the yahoo_merge branch, but it
should be the same in all branches.

When running stress tests, the NameNode daemon receives a ConcurrentModificationException
and exits during certain race conditions.

This seems to be a fairly critical bug that could cause the NameNode to exit under stress

The node configuration I am using is running a single indepent namenode on one machine and
hundreds of simulated (by MiniDFSCluster) datanodes on each of 9 other machines, for a total
of up to 2000 simulated datanodes.

Than, in this environment, the DataNodeGenerator test is run, which does random reads, creates,
writes, and deletes. The goal is to stress the NameNode with hundreds of operations per second.

In some race conditions, when ReplicationMonitor() is calculating invalid blocks, the recentInvalidateSets
TreeMap within BlockManager is being modified by one thread while the ReplicationMonitor()
is iterating over it.

Here is the exception and stack traceback:

2011-06-08 15:33:41,551 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: ReplicationMonitor
thread received Runtime exception.  
        at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
        at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
        at java.util.AbstractCollection.toArray(AbstractCollection.java:124)
        at java.util.ArrayList.<init>(ArrayList.java:131)
        at org.apache.hadoop.hdfs.server.namenode.BlockManager.computeInvalidateWork(BlockManager.java:682)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.computeDatanodeWork(FSNamesystem.java:2978)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:2925)
        at java.lang.Thread.run(Thread.java:619)

One thing I did try was to go into the BlockManager and put 'synchronized()' around all places
that iterate over, add to, or remove from the recentInvalidateSets TreeMap variable.

I'm not sure what performance (or other unforseen) ramifications this may have.

However, I was able to eliminate the ConcurrentModificationException by using this fix, at
least in my test

> Race condition introduced by HADOOP-5124
> ----------------------------------------
>                 Key: HDFS-1257
>                 URL: https://issues.apache.org/jira/browse/HDFS-1257
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>            Reporter: Ramkumar Vadali
>         Attachments: HDFS-1257.patch
> HADOOP-5124 provided some improvements to FSNamesystem#recentInvalidateSets. But it introduced
unprotected access to the data structure recentInvalidateSets. Specifically, FSNamesystem.computeInvalidateWork
accesses recentInvalidateSets without read-lock protection. If there is concurrent activity
(like reducing replication on a file) that adds to recentInvalidateSets, the name-node crashes
with a ConcurrentModificationException.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message