hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-14459) ClosedChannelException silently ignored in FsVolumeList.addBlockPool()
Date Thu, 08 Aug 2019 20:54:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-14459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903322#comment-16903322
] 

Hudson commented on HDFS-14459:
-------------------------------

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #17066 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17066/])
HDFS-14459. ClosedChannelException silently ignored in (weichiu: rev b0799148cf6e92be540f5665bb571418b916d789)
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
* (add) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestAddBlockPoolException.java
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/AddBlockPoolException.java
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeList.java


> ClosedChannelException silently ignored in FsVolumeList.addBlockPool()
> ----------------------------------------------------------------------
>
>                 Key: HDFS-14459
>                 URL: https://issues.apache.org/jira/browse/HDFS-14459
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.3.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>             Fix For: 3.3.0, 3.2.1, 3.1.3
>
>         Attachments: HDFS-14459.001.patch, HDFS-14459.002.patch, HDFS-14459.003.patch
>
>
> Following on HDFS-14333, I encountered another scenario when a volume has some sort of
disk level errors it can silently fail to have the blockpool added to itself in FsVolumeList.addBlockPool().
> In the logs for a recent issue we see the following pattern:
> {code}
> 2019-04-24 04:21:27,690 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Added volume - /CDH/sdi1/dfs/dn/current, StorageType: DISK
> 2019-04-24 04:21:27,691 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Added new volume: DS-694ae931-8a4e-42d5-b2b3-d946e35c6b47
> ...
> 2019-04-24 04:21:27,703 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Scanning block pool BP-936404344-xxx-1426594942733 on volume /CDH/sdi1/dfs/dn/current...
> ...
> <HERE WE ARE MISSING THE LOG LIKE:
> 2019-04-24 04:21:27,722 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Time taken to scan block pool BP-936404344-xxx-1426594942733 on /CDH/sdi1/dfs/dn/current:
19ms
> >
> ...
> 2019-04-24 04:21:29,871 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Adding replicas to map for block pool BP-936404344-xxx-1426594942733 on volume /CDH/sdi1/dfs/dn/current...
> ...
> 2019-04-24 04:21:29,872 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Caught exception while adding replicas from /CDH/sdi1/dfs/dn/current. Will throw later.
> java.io.IOException: block pool BP-936404344-10.7.192.215-1426594942733 is not found
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getBlockPoolSlice(FsVolumeImpl.java:407)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getVolumeMap(FsVolumeImpl.java:864)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$1.run(FsVolumeList.java:191
> {code}
> The notable point, is that the 'scanning block pool' step must not have completed properly
for this volume but nothing was logged and then the slightly confusing error is logged when
attempting to add the replicas. That error occurs as the block pool was not added to the volume
by the addBlockPool step.
> The relevant part of the code in 'addBlockPool()' from current trunk looks like:
> {code}
>     for (final FsVolumeImpl v : volumes) {
>       Thread t = new Thread() {
>         public void run() {
>           try (FsVolumeReference ref = v.obtainReference()) {
>             FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>                 " on volume " + v + "...");
>             long startTime = Time.monotonicNow();
>             v.addBlockPool(bpid, conf);
>             long timeTaken = Time.monotonicNow() - startTime;
>             FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>                 " on " + v + ": " + timeTaken + "ms");
>           } catch (ClosedChannelException e) {
>             // ignore.
>           } catch (IOException ioe) {
>             FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>                 ". Will throw later.", ioe);
>             unhealthyDataDirs.put(v, ioe);
>           }
>         }
>       };
>       blockPoolAddingThreads.add(t);
>       t.start();
>     }
> {code}
> As we get the first log message (Scanning block pool ... ), but not the second (Time
take to scan block pool ...), and we don't get anything logged or an exception thrown, then
the operation must have encountered a ClosedChannelException which is silently ignored.
> I am also not sure if we should ignore a ClosedChannelException, as it means the volume
failed to add fully. As ClosedChannelException is a subclass of IOException perhaps we can
remove that catch block entirely?
> Finally, HDFS-14333 refactored the above code to allow the DN to better handle a disk
failure on DN startup. However, if addBlockPool does throw an exception, it will mean getAllVolumesMap()
will not get called and the DN will end up partly initialized.
> DataNode.initBlockPool() calls FsDatasetImpl.addBlockPool() which looks like the following,
calling addBlockPool() and then getAllVolumesMap():
> {code}
> public void addBlockPool(String bpid, Configuration conf)
>       throws IOException {
>     LOG.info("Adding block pool " + bpid);
>     try (AutoCloseableLock lock = datasetLock.acquire()) {
>       volumes.addBlockPool(bpid, conf);
>       volumeMap.initBlockPool(bpid);
>     }
>     volumes.getAllVolumesMap(bpid, volumeMap, ramDiskReplicaTracker);
>   }
> {code}
> This needs refactored to catch any AddBlockPoolException raised in addBlockPool, then
continue to call getAllVolumesMap() before re-throwing any of the caught exceptions to allow
the DN to handle the individual volume failures.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message