hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rushabh S Shah (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10816) TestComputeInvalidateWork#testDatanodeReRegistration fails due to race between test and replication monitor
Date Tue, 30 Aug 2016 20:18:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450035#comment-15450035
] 

Rushabh S Shah commented on HDFS-10816:
---------------------------------------

[~ebadger]: Thanks for reporting and analyzing the failure.
This test broke in our internal build recently.
Below are the relevant logs:
{noformat}
2016-08-29 01:54:49,332 INFO  impl.RamDiskAsyncLazyPersistService (RamDiskAsyncLazyPersistService.java:shutdown(169))
- All async lazy persist service threads have been shut down
2016-08-29 01:54:49,336 INFO  datanode.DataNode (DataNode.java:shutdown(1791)) - Shutdown
complete.
2016-08-29 01:54:49,347 INFO  BlockStateChange (BlockManager.java:addToInvalidates(1228))
- BLOCK* addToInvalidates: blk_1073741825_1001 127.0.0.1:57662 127.0.0.1:43137 127.0.0.1:59637

2016-08-29 01:54:49,349 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditMessage(8476))
- allowed=true	ugi=tortuga (auth:SIMPLE)	ip=/127.0.0.1	cmd=delete	src=/testRR	dst=null	perm=null
proto=rpc
2016-08-29 01:54:49,350 INFO  BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3582))
- BLOCK* BlockManager: ask 127.0.0.1:59637 to delete [blk_1073741825_1001]
2016-08-29 01:54:49,355 INFO  hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1725)) - Shutting
down the Mini HDFS Cluster
{noformat}

bq. 2016-08-29 01:54:49,336 INFO  datanode.DataNode (DataNode.java:shutdown(1791)) - Shutdown
complete.
This line corresponds to shutting down the last datanode.
bq. 2016-08-29 01:54:49,347 INFO  BlockStateChange (BlockManager.java:addToInvalidates(1228))
- BLOCK* addToInvalidates: blk_1073741825_1001 127.0.0.1:57662 127.0.0.1:43137 127.0.0.1:59637

After stopping the last datanode, I can see the InvalidateBlocks size is 3.
bq. 2016-08-29 01:54:49,350 INFO  BlockStateChange (BlockManager.java:invalidateWorkForOneNode(3582))
- BLOCK* BlockManager: ask 127.0.0.1:59637 to delete \[blk_1073741825_1001\]
Then the replication monitor woke up and removed one block from the invalidateBlocks set 

I think the test was checking the invalidateBlock size just after the replication monitor
computed invalidate work for one node and that failed.
I think stopping the replication monitor is the correct fix.

[~jojochuang], [~zhz]: Since you reviewed HDFS-9580, can you please help reviewing this patch.

> TestComputeInvalidateWork#testDatanodeReRegistration fails due to race between test and
replication monitor
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10816
>                 URL: https://issues.apache.org/jira/browse/HDFS-10816
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>         Attachments: HDFS-10816.001.patch
>
>
> {noformat}
> java.lang.AssertionError: Expected invalidate blocks to be the number of DNs expected:<3>
but was:<2>
> 	at org.junit.Assert.fail(Assert.java:88)
> 	at org.junit.Assert.failNotEquals(Assert.java:743)
> 	at org.junit.Assert.assertEquals(Assert.java:118)
> 	at org.junit.Assert.assertEquals(Assert.java:555)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.TestComputeInvalidateWork.testDatanodeReRegistration(TestComputeInvalidateWork.java:160)
> {noformat}
> The test fails because of a race condition between the test and the replication monitor.
The default replication monitor interval is 3 seconds, which is just about how long the test
normally takes to run. The test deletes a file and then subsequently gets the namesystem writelock.
However, if the replication monitor fires in between those two instructions, the test will
fail as it will itself invalidate one of the blocks. This can be easily reproduced by removing
the sleep() in the ReplicationMonitor's run() method in BlockManager.java, so that the replication
monitor executes as quickly as possible and exacerbates the race. 
> To fix the test all that needs to be done is to turn off the replication monitor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message