hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11070) NPE in BlockSender due to race condition
Date Wed, 02 Nov 2016 19:50:58 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei-Chiu Chuang updated HDFS-11070:
-----------------------------------
    Attachment: HDFS-11070.001.patch

Upload a trivial fix to remove the possible race condition. This fix uses block id instead
of ExtendedBlock to get FsVolumeImpl object.

> NPE in BlockSender due to race condition
> ----------------------------------------
>
>                 Key: HDFS-11070
>                 URL: https://issues.apache.org/jira/browse/HDFS-11070
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0-alpha2
>            Reporter: Wei-Chiu Chuang
>         Attachments: HDFS-11070.001.patch
>
>
> Saw the following NPE in a unit test:
> {quote}
> 2016-10-27 14:42:58,450 ERROR DataNode - 127.0.0.1:51987:DataXceiver error processing
READ_BLOCK operation  src: /127.0.0.1:52429 dst: /127.0.0.1:51987
> java.lang.NullPointerException
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:284)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:581)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:150)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:102)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:289)
> 	at java.lang.Thread.run(Thread.java:745)
> {quote}
> The NPE occurred here:
> {code:title=BlockSender.<init>}
>       // Obtain a reference before reading data
>       this.volumeRef = datanode.data.getVolume(block).obtainReference();
> {code}
> Right before the NPE was a few lines of debug message that indicated the replica was
appended and updated.
> {quote}
> 2016-10-27 14:42:58,442 DEBUG DataNode - block=BP-1071315328-172.16.1.88-1477604513635:blk_1073741825_1192,
replica=FinalizedReplica, blk_1073741825_1192, FINALIZED
>   getNumBytes()     = 192
>   getBytesOnDisk()  = 192
>   getVisibleLength()= 192
>   getVolume()       = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1
>   getBlockURI()     = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825
> 2016-10-27 14:42:58,442 INFO  FsDatasetImpl - Appending to FinalizedReplica, blk_1073741825_1192,
FINALIZED
>   getNumBytes()     = 192
>   getBytesOnDisk()  = 192
>   getVisibleLength()= 192
>   getVolume()       = /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1
>   getBlockURI()     = file:/Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825
> 2016-10-27 14:42:58,442 DEBUG FsDatasetCache - Block with id 1073741825, pool BP-1071315328-172.16.1.88-1477604513635
does not need to be uncached, because it is not currently in the mappableBlockMap.
> 2016-10-27 14:42:58,450 DEBUG LocalReplica - Renaming /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825_1192.meta
to /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825_1193.meta
> 2016-10-27 14:42:58,450 DEBUG LocalReplica - Renaming /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/finalized/subdir0/subdir0/blk_1073741825
to /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825,
file length=192
> 2016-10-27 14:42:58,450 DEBUG DataNode - writeTo blockfile is /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825
of size 192
> 2016-10-27 14:42:58,450 DEBUG DataNode - writeTo metafile is /Users/weichiu/sandbox/hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/target/test-dir/dfs/data/data1/current/BP-1071315328-172.16.1.88-1477604513635/current/rbw/blk_1073741825_1193.meta
of size 11
> {quote}
> The block object's genstamp should have been the same as the ondisk replica. However,
the log suggests the replica's genstamp may have been updated after:
> {code:title=BlockSender.<init>}
>       try(AutoCloseableLock lock = datanode.data.acquireDatasetLock()) {
>         replica = getReplica(block, datanode);
>         replicaVisibleLength = replica.getVisibleLength();
>       }
>       // if there is a write in progress
>       ChunkChecksum chunkChecksum = null;
>       if (replica.getState() == ReplicaState.RBW) {
>         final ReplicaInPipeline rbw = (ReplicaInPipeline) replica;
>         waitForMinLength(rbw, startOffset + length);
>         chunkChecksum = rbw.getLastChecksumAndDataLen();
>       }
> {code}
> In summary, I think the assumption here is not valid, because a write-in-progress may
happen after that check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message