hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manoj Govindassamy (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-9781) FsDatasetImpl#getBlockReports can occasionally throw NullPointerException
Date Thu, 01 Sep 2016 22:49:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453821#comment-15453821
] 

Manoj Govindassamy edited comment on HDFS-9781 at 9/1/16 10:48 PM:
-------------------------------------------------------------------

Ref: {noformat}
  @Override
  public Map<DatanodeStorage, BlockListAsLongs> getBlockReports(String bpid) {
    Map<DatanodeStorage, BlockListAsLongs> blockReportsMap =
        new HashMap<DatanodeStorage, BlockListAsLongs>();

    Map<String, BlockListAsLongs.Builder> builders =
        new HashMap<String, BlockListAsLongs.Builder>();

    List<FsVolumeImpl> curVolumes = volumes.getVolumes();
    for (FsVolumeSpi v : curVolumes) {
        builders.put(v.getStorageID(), BlockListAsLongs.builder(maxDataLength));
    }

    try (AutoCloseableLock lock = datasetLock.acquire()) {
      for (ReplicaInfo b : volumeMap.replicas(bpid)) {
        switch(b.getState()) {
          case FINALIZED:
          case RBW:
          case RWR:
            builders.get(b.getVolume().getStorageID()).add(b);   <=== NPE
            break;
          case RUR:
{noformat}

[~virajith],

{quote}
One way to fix this might be to call volumes.waitVolumeRemoved() before volumes.removeVolume()
in FsDatasetImpl#removeVolumes()
{quote}
* {noformat}
          LOG.info("Removing " + absRoot + " from FsDataset.");
          // Disable the volume from the service.
          asyncDiskService.removeVolume(sd.getCurrentDir());
          volumes.removeVolume(absRoot, clearFailure);
          volumes.waitVolumeRemoved(5000, this);
{noformat}

* I believe, {{volumes.waitVolumeRemoved(5000, this)}} cannot be moved before {{volumes.removeVolume(absRoot,
clearFailure);}} as the waitVolumesRemoved() is dependent on the volumesBeingRemoved queue
which is filled in by the removeVolume(). 
* Also, the intention of waitVolumesRemoved() is to wait for the removed volume's reference
to go to zero.


[~daryn], [~virajith],

{quote}
make FsDatasetImpl#getBlockReports() hold the lock while it is referring to both volumes and
volumeMap i.e., move synchronized(this) before volumes.getVolumes().
{quote}

* I guess your proposal here is is something like below. (in the latest code base, i am assuming
datasetLock as the synchronized block you were referring)
** {noformat}
try (AutoCloseableLock lock = datasetLock.acquire()) {
      curVolumes = volumes.getVolumes();
      for (FsVolumeSpi v : curVolumes) {
        builders.put(v.getStorageID(), BlockListAsLongs.builder(maxDataLength));
} {noformat}

* IMHO, moving volumes.getVolumes() alone inside the lock will not solve the problem, particularly
the NPE we are seeing in TestDataNodeHotSwapVolumes. Because, the NPE is happening when doing
add after builders.get(<storageID>). Which means, builders map doesn't have the storageID
being passed in. 
* That is,
** {{curVolumes}} list is not the problem here. This list has the right set of volumes after
the previous volume removal operation.
** Its the {{ReplicaInfo}} from {{volumeMap.replicas(bpid)}} which has the reference to the
removed volume
** But, {{builders}} map doesn't have the removed volume 
** So, {{builders.get(b.getVolume().getStorageID()).add(b);}}  will turn to *null.add(b)*
and so the NPE.

Please let me know if my understanding is wrong.



was (Author: manojg):

Ref: {noformat}
  @Override
  public Map<DatanodeStorage, BlockListAsLongs> getBlockReports(String bpid) {
    Map<DatanodeStorage, BlockListAsLongs> blockReportsMap =
        new HashMap<DatanodeStorage, BlockListAsLongs>();

    Map<String, BlockListAsLongs.Builder> builders =
        new HashMap<String, BlockListAsLongs.Builder>();

    List<FsVolumeImpl> curVolumes = volumes.getVolumes();
    for (FsVolumeSpi v : curVolumes) {
        builders.put(v.getStorageID(), BlockListAsLongs.builder(maxDataLength));
    }

    try (AutoCloseableLock lock = datasetLock.acquire()) {
      for (ReplicaInfo b : volumeMap.replicas(bpid)) {
        switch(b.getState()) {
          case FINALIZED:
          case RBW:
          case RWR:
            builders.get(b.getVolume().getStorageID()).add(b);   <=== NPE
            break;
          case RUR:
{noformat}


[~daryn], 

{quote}
One way to fix this might be to call volumes.waitVolumeRemoved() before volumes.removeVolume()
in FsDatasetImpl#removeVolumes()
{quote}

{noformat}
          LOG.info("Removing " + absRoot + " from FsDataset.");
          // Disable the volume from the service.
          asyncDiskService.removeVolume(sd.getCurrentDir());
          volumes.removeVolume(absRoot, clearFailure);
          volumes.waitVolumeRemoved(5000, this);
{noformat}

* I believe, {{volumes.waitVolumeRemoved(5000, this)}} cannot be moved before {{volumes.removeVolume(absRoot,
clearFailure);}} as the waitVolumesRemoved() is dependent on the volumesBeingRemoved queue
which is filled in by the removeVolume(). 
* Also, the intention of waitVolumesRemoved() is to wait for the removed volume's reference
to go to zero.


[~virajith],[~daryn],

{quote}
make FsDatasetImpl#getBlockReports() hold the lock while it is referring to both volumes and
volumeMap i.e., move synchronized(this) before volumes.getVolumes().
{quote}

* I guess your proposal here is is something like below. (in the latest code base, i am assuming
datasetLock as the synchronized block you were referring)
** {noformat}
try (AutoCloseableLock lock = datasetLock.acquire()) {
      curVolumes = volumes.getVolumes();
      for (FsVolumeSpi v : curVolumes) {
        builders.put(v.getStorageID(), BlockListAsLongs.builder(maxDataLength));
} {noformat}

* IMHO, moving volumes.getVolumes() alone inside the lock will not solve the problem, particularly
the NPE we are seeing in TestDataNodeHotSwapVolumes. Because, the NPE is happening when doing
add after builders.get(<storageID>). Which means, builders map doesn't have the storageID
being passed in. 
* That is,
** {{curVolumes}} list is not the problem here. This list has the right set of volumes after
the previous volume removal operation.
** Its the {{ReplicaInfo}} from {{volumeMap.replicas(bpid)}} which has the reference to the
removed volume
** But, {{builders}} map doesn't have the removed volume 
** So, {{builders.get(b.getVolume().getStorageID()).add(b);}}  will turn to *null.add(b)*
and so the NPE.

Please let me know if my understanding is wrong.


> FsDatasetImpl#getBlockReports can occasionally throw NullPointerException
> -------------------------------------------------------------------------
>
>                 Key: HDFS-9781
>                 URL: https://issues.apache.org/jira/browse/HDFS-9781
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0-alpha1
>         Environment: Jenkins
>            Reporter: Wei-Chiu Chuang
>            Assignee: Manoj Govindassamy
>         Attachments: HDFS-9781.01.patch
>
>
> FsDatasetImpl#getBlockReports occasionally throws NPE. The NPE caused TestFsDatasetImpl#testRemoveVolumeBeingWritten
to time out, because the test waits for the call to FsDatasetImpl#getBlockReports to complete
without exceptions.
> Additionally, the test should be updated to identify an expected exception, using {{GenericTestUtils.assertExceptionContains()}}
> {noformat}
> Exception in thread "Thread-20" java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1709)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl$1BlockReportThread.run(TestFsDatasetImpl.java:587)
> 2016-02-08 15:47:30,379 [Thread-21] WARN  impl.TestFsDatasetImpl (TestFsDatasetImpl.java:run(606))
- Exception caught. This should not affect the test
> java.io.IOException: Failed to move meta file for ReplicaBeingWritten, blk_0_0, RBW
>   getNumBytes()     = 0
>   getBytesOnDisk()  = 0
>   getVisibleLength()= 0
>   getVolume()       = /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current
>   getBlockFile()    = /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0
>   bytesAcked=0
>   bytesOnDisk=0 from /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0_0.meta
to /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/finalized/subdir0/subdir0/blk_0_0.meta
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.moveBlockFiles(FsDatasetImpl.java:857)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addFinalizedBlock(BlockPoolSlice.java:295)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addFinalizedBlock(FsVolumeImpl.java:819)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeReplica(FsDatasetImpl.java:1620)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1601)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl$1ResponderThread.run(TestFsDatasetImpl.java:603)
> Caused by: java.io.IOException: renameTo(src=/home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0_0.meta,
dst=/home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/finalized/subdir0/subdir0/blk_0_0.meta)
failed.
>         at org.apache.hadoop.io.nativeio.NativeIO.renameTo(NativeIO.java:873)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.moveBlockFiles(FsDatasetImpl.java:855)
>         ... 5 more
> 2016-02-08 15:47:34,381 [Thread-19] INFO  impl.FsDatasetImpl (FsVolumeList.java:waitVolumeRemoved(287))
- Volume reference is released.
> 2016-02-08 15:47:34,384 [Thread-19] INFO  impl.TestFsDatasetImpl (TestFsDatasetImpl.java:testRemoveVolumeBeingWritten(622))
- Volumes removed
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message