hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7996) After swapping a volume, BlockReceiver reports ReplicaNotFoundException
Date Fri, 03 Apr 2015 21:17:53 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395095#comment-14395095
] 

Colin Patrick McCabe commented on HDFS-7996:
--------------------------------------------

bq. As shown in the existing code (above), the IOE is captured, so both streams and ReplicaHandler
are closed if there is an IOE when closing the streams.

Thanks for the correction.  You're right that the streams will be closed (and the volume unreferenced)
when an IOE is thrown.  One case that isn't handled is if a RuntimeException is thrown.  I
suppose arguably these exceptions should not occur from this code, though.  So it's probably
fine as-is.
 
{code}
try (ReplicaHandler handler = BlockReceiver.this.claimReplicaHandler()) {
{code}

It's interesting that {{handler}} can be null here.  It looks like the java7 try-with-resources
idiom handles this, by silently ignoring null resources.  See https://blogs.oracle.com/darcy/entry/project_coin_null_try_with

Thanks, Eddy.  +1

> After swapping a volume, BlockReceiver reports ReplicaNotFoundException
> -----------------------------------------------------------------------
>
>                 Key: HDFS-7996
>                 URL: https://issues.apache.org/jira/browse/HDFS-7996
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.6.0
>            Reporter: Lei (Eddy) Xu
>            Assignee: Lei (Eddy) Xu
>            Priority: Critical
>         Attachments: HDFS-7996.000.patch, HDFS-7996.001.patch, HDFS-7996.002.patch
>
>
> When removing a disk from an actively writing DataNode, the BlockReceiver working on
the disk throws {{ReplicaNotFoundException}} because the replicas are removed from the memory:
> {code}
> 2015-03-26 08:02:43,154 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Removed volume: /data/2/dfs/dn/current
> 2015-03-26 08:02:43,163 INFO org.apache.hadoop.hdfs.server.common.Storage: Removing block
level storage: /data/2/dfs/dn/current/BP-51301509-10.20.202.114-1427296597742
> 2015-03-26 08:02:43,163 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException
in BlockReceiver.run():
> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Cannot append to a non-existent
replica BP-51301509-10.20.202.114-1427296597742:blk_1073742979_2160
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getReplicaInfo(FsDatasetImpl.java:615)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1362)
>         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.finalizeBlock(BlockReceiver.java:1281)
>         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1241)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> {{FsVolumeList#removeVolume}} waits all threads release {{FsVolumeReference}} on the
volume to be removed, however, in {{PacketResponder#finalizeBlock()}}, it calls
> {code}
> private void finalizeBlock(long startTime) throws IOException {
>       BlockReceiver.this.close();
>       final long endTime = ClientTraceLog.isInfoEnabled() ? System.nanoTime()
>           : 0;
>       block.setNumBytes(replicaInfo.getNumBytes());
>       datanode.data.finalizeBlock(block);
> {code}
> The {{FsVolumeReference}} was released in {{BlockReceiver.this.close()}} before calling
{{datanode.data.finalizeBlock(block)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message