hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manoj Govindassamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9781) FsDatasetImpl#getBlockReports can occasionally throw NullPointerException
Date Thu, 01 Sep 2016 02:05:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15454036#comment-15454036
] 

Manoj Govindassamy commented on HDFS-9781:
------------------------------------------

Ref:
{code:title=FsDatasetImpl.java|borderStyle=solid}
  @Override
  public void removeVolumes(Set<File> volumesToRemove, boolean clearFailure) {
    ..
    ..
    try (AutoCloseableLock lock = datasetLock.acquire()) {              <== LOCK acquire
datasetLock
      for (int idx = 0; idx < dataStorage.getNumStorageDirs(); idx++) {
          .. .. ..
          asyncDiskService.removeVolume(sd.getCurrentDir());            <== volume SD1
remove
          volumes.removeVolume(absRoot, clearFailure);
          volumes.waitVolumeRemoved(5000, this);                        <== WAIT on "this"
?? But, we haven't locked it yet.
                                                                                 This will
cause IllegalMonitorStateException
                                                                                 and crash
getBlockReports()/FBR thread!

          for (String bpid : volumeMap.getBlockPoolList()) {
            List<ReplicaInfo> blocks = new ArrayList<>();
            for (Iterator<ReplicaInfo> it = volumeMap.replicas(bpid).iterator();
                 it.hasNext(); ) {
                .. .. ..
                it.remove();                                           <== volumeMap removal
              }
            blkToInvalidate.put(bpid, blocks);
          }
         .. ..
    }                                                                  <== LOCK release
datasetLock

    // Call this outside the lock.
    for (Map.Entry<String, List<ReplicaInfo>> entry :
        blkToInvalidate.entrySet()) {
      ..
      for (ReplicaInfo block : blocks) {
        invalidate(bpid, block);                                       <== Notify NN of
Block removal
      }
    }
{code}


* NPE is because of the the contending operations in {{FsDataSetImpl}} between {{getBlockReports()}}
and {{removeVolumes()}}

* Thread 1: {{removeVolumes()}}
** LOCK datasetLock
** remove volume from {{FsVolumeList}} volumes
** wait for volume references to go to zero
*** But, volumes.waitVolumesRemoved() is waiting on "this" monitor which this thread hasn't
grabbed at all. This will cause IllegalMonitorStateException and crash the whole thread! 
*** I am assuming the intention here is to either wait on datasetLock OR lock + wait on "this"
lock. And for 5 seconds max!
** queue the block from removed volume to an invalidation map 
** remove block pool mapping from {{ReplicaMap}} volumeMap for the removed volume
** UNLOCK datasetLock
* continue with block invalidation


* Thread 2: {{getBlockReports()}}
** Get current volumes from {{FsVolumeList}} and intialize the FBR map 
** LOCK datasetLock
** Get all {{ReplicaInfo}} from {{ReplicaMap}} for the given block pool id
** Transfer ReplicaInfo to a respective volume entry in FBR map 
** UNLOCK datasetLock

* Because of the wait() (assuming the monitor object is fixed) in {{removeVolumes()}}, both
operations cannot be totally exclusive of each other!
* So, what is the expected behavior of FBR when there is volume remove in progress ??
** (A) Should it include replica info even for the volumes being removed ?
** (B) Or, should it include replica info only the live volumes ?
I am inclined on (B).

So, my proposal here is
1. Fix {{getBlockReports()}} as in (B) -- to ignore the unavailable volumes from the {{ReplicaMap}}
as its mappings are not removed right away and could involve a wait() inbetween. // as part
of this bug 
2. Fix {{removeVolumes()}} volumes.waitVolumesRemoved() either to carry the right monitor
object or take added locks for the new monitors. // this bug or new bug ??

[~eddyxu], [~xiaochen], [~virajith], [~daryn],  Please let me know your thoughts on above.

> FsDatasetImpl#getBlockReports can occasionally throw NullPointerException
> -------------------------------------------------------------------------
>
>                 Key: HDFS-9781
>                 URL: https://issues.apache.org/jira/browse/HDFS-9781
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0-alpha1
>         Environment: Jenkins
>            Reporter: Wei-Chiu Chuang
>            Assignee: Manoj Govindassamy
>         Attachments: HDFS-9781.01.patch
>
>
> FsDatasetImpl#getBlockReports occasionally throws NPE. The NPE caused TestFsDatasetImpl#testRemoveVolumeBeingWritten
to time out, because the test waits for the call to FsDatasetImpl#getBlockReports to complete
without exceptions.
> Additionally, the test should be updated to identify an expected exception, using {{GenericTestUtils.assertExceptionContains()}}
> {noformat}
> Exception in thread "Thread-20" java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1709)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl$1BlockReportThread.run(TestFsDatasetImpl.java:587)
> 2016-02-08 15:47:30,379 [Thread-21] WARN  impl.TestFsDatasetImpl (TestFsDatasetImpl.java:run(606))
- Exception caught. This should not affect the test
> java.io.IOException: Failed to move meta file for ReplicaBeingWritten, blk_0_0, RBW
>   getNumBytes()     = 0
>   getBytesOnDisk()  = 0
>   getVisibleLength()= 0
>   getVolume()       = /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current
>   getBlockFile()    = /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0
>   bytesAcked=0
>   bytesOnDisk=0 from /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0_0.meta
to /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/finalized/subdir0/subdir0/blk_0_0.meta
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.moveBlockFiles(FsDatasetImpl.java:857)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addFinalizedBlock(BlockPoolSlice.java:295)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addFinalizedBlock(FsVolumeImpl.java:819)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeReplica(FsDatasetImpl.java:1620)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1601)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl$1ResponderThread.run(TestFsDatasetImpl.java:603)
> Caused by: java.io.IOException: renameTo(src=/home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0_0.meta,
dst=/home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/finalized/subdir0/subdir0/blk_0_0.meta)
failed.
>         at org.apache.hadoop.io.nativeio.NativeIO.renameTo(NativeIO.java:873)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.moveBlockFiles(FsDatasetImpl.java:855)
>         ... 5 more
> 2016-02-08 15:47:34,381 [Thread-19] INFO  impl.FsDatasetImpl (FsVolumeList.java:waitVolumeRemoved(287))
- Volume reference is released.
> 2016-02-08 15:47:34,384 [Thread-19] INFO  impl.TestFsDatasetImpl (TestFsDatasetImpl.java:testRemoveVolumeBeingWritten(622))
- Volumes removed
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message