hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Virajith Jalaparti (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9781) FsDatasetImpl#getBlockReports can occasionally throw NullPointerException
Date Mon, 14 Mar 2016 18:12:33 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193797#comment-15193797
] 

Virajith Jalaparti commented on HDFS-9781:
------------------------------------------

Hi [~xiaochen] and [~jojochuang], 

I encountered this test case failing and when I tried to understand the reason, I realized
this might be a general problem (not particular to this test) due to the way {{FsDatasetImpl#getBlockReports()}}
and {{FsDatasetImpl#removeVolumes()}} inconsistently handle {{volumeMap}} and {{volumes}}
under the {{FsDatasetImpl}} instance lock. {{FsDatasetImpl#removeVolumes()}} refers to both
the objects under the lock. {{FsDatasetImpl#getBlockReports()}} calls {{volumes.getVolumes()}}
without the lock but refers to {{volumeMap}} with the lock. I think this is what is causing
the NPE problem due to the following sequence of events: 

# {{volumes.removeVolume()}} is called in {{FsDatasetImpl#removeVolumes()}}. Suppose this
removes Volume A.
# {{volumes.getVolumes()}} is called in {{FsDatasetImpl#getBlockReports()}}. This would return
all volumes except Volume A.
# {{volumes.waitVolumeRemoved()}} (added as part of HDFS-9701) is called in {{FsDatasetImpl#removeVolumes()}}
which releases the lock. 
# The for loop {{volumeMap.replicas(bpid)}} in {{FsDatasetImpl#getBlockReports()}} starts
running. This would return {{ReplicaInfo}} s that are stored on Volume A as  {{FsDatasetImpl#removeVolumes()}}
did not still delete them. 

If I understand correctly, {{volumes.waitVolumeRemoved()}} was added in HDFS-9701 to avoid
the deadlock issue being addressed. 

One way to fix this might be to call {{volumes.waitVolumeRemoved()}} before {{volumes.removeVolume()}}
in {{FsDatasetImpl#removeVolumes()}} and make {{FsDatasetImpl#getBlockReports()}} hold the
lock while it is referring to both {{volumes}} and {{volumeMap}} i.e., move {{synchronized(this)}}
before {{volumes.getVolumes()}}. 

> FsDatasetImpl#getBlockReports can occasionally throw NullPointerException
> -------------------------------------------------------------------------
>
>                 Key: HDFS-9781
>                 URL: https://issues.apache.org/jira/browse/HDFS-9781
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0
>         Environment: Jenkins
>            Reporter: Wei-Chiu Chuang
>            Assignee: Xiao Chen
>         Attachments: HDFS-9781.01.patch
>
>
> FsDatasetImpl#getBlockReports occasionally throws NPE. The NPE caused TestFsDatasetImpl#testRemoveVolumeBeingWritten
to time out, because the test waits for the call to FsDatasetImpl#getBlockReports to complete
without exceptions.
> Additionally, the test should be updated to identify an expected exception, using {{GenericTestUtils.assertExceptionContains()}}
> {noformat}
> Exception in thread "Thread-20" java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1709)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl$1BlockReportThread.run(TestFsDatasetImpl.java:587)
> 2016-02-08 15:47:30,379 [Thread-21] WARN  impl.TestFsDatasetImpl (TestFsDatasetImpl.java:run(606))
- Exception caught. This should not affect the test
> java.io.IOException: Failed to move meta file for ReplicaBeingWritten, blk_0_0, RBW
>   getNumBytes()     = 0
>   getBytesOnDisk()  = 0
>   getVisibleLength()= 0
>   getVolume()       = /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current
>   getBlockFile()    = /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0
>   bytesAcked=0
>   bytesOnDisk=0 from /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0_0.meta
to /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/finalized/subdir0/subdir0/blk_0_0.meta
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.moveBlockFiles(FsDatasetImpl.java:857)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addFinalizedBlock(BlockPoolSlice.java:295)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addFinalizedBlock(FsVolumeImpl.java:819)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeReplica(FsDatasetImpl.java:1620)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1601)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl$1ResponderThread.run(TestFsDatasetImpl.java:603)
> Caused by: java.io.IOException: renameTo(src=/home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0_0.meta,
dst=/home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/finalized/subdir0/subdir0/blk_0_0.meta)
failed.
>         at org.apache.hadoop.io.nativeio.NativeIO.renameTo(NativeIO.java:873)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.moveBlockFiles(FsDatasetImpl.java:855)
>         ... 5 more
> 2016-02-08 15:47:34,381 [Thread-19] INFO  impl.FsDatasetImpl (FsVolumeList.java:waitVolumeRemoved(287))
- Volume reference is released.
> 2016-02-08 15:47:34,384 [Thread-19] INFO  impl.TestFsDatasetImpl (TestFsDatasetImpl.java:testRemoveVolumeBeingWritten(622))
- Volumes removed
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message