hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-14333) Datanode fails to start if any disk has errors during Namenode registration
Date Fri, 08 Mar 2019 13:19:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-14333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787876#comment-16787876
] 

Hadoop QA commented on HDFS-14333:
----------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 37s{color} | {color:blue}
Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  0s{color} |
{color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m  0s{color}
| {color:green} The patch appears to include 3 new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 58s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 59s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 58s{color}
| {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  5s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 28s{color}
| {color:green} branch has no errors when building and testing our client artifacts. {color}
|
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 59s{color} |
{color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 48s{color} |
{color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 58s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 53s{color} |
{color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 53s{color} | {color:red}
hadoop-hdfs-project_hadoop-hdfs generated 1 new + 476 unchanged - 1 fixed = 477 total (was
477) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 53s{color}
| {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 59s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m  0s{color}
| {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 30s{color}
| {color:green} patch has no errors when building and testing our client artifacts. {color}
|
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  0s{color} |
{color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 46s{color} |
{color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}100m 48s{color} | {color:green}
hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 34s{color}
| {color:green} The patch does not generate ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}157m 52s{color} | {color:black}
{color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | HDFS-14333 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12961703/HDFS-14333.004.patch
|
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  mvnsite  unit
 shadedclient  findbugs  checkstyle  |
| uname | Linux bf780b000b0d 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018
x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / fb851c9 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| javac | https://builds.apache.org/job/PreCommit-HDFS-Build/26436/artifact/out/diff-compile-javac-hadoop-hdfs-project_hadoop-hdfs.txt
|
|  Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/26436/testReport/ |
| Max. process+thread count | 2999 (vs. ulimit of 10000) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
| Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/26436/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Datanode fails to start if any disk has errors during Namenode registration
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-14333
>                 URL: https://issues.apache.org/jira/browse/HDFS-14333
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.3.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: HADOOP-16119.poc.patch, HDFS-14333.001.patch, HDFS-14333.002.patch,
HDFS-14333.003.patch, HDFS-14333.004.patch
>
>
> This is closely related to HDFS-9908, where it was reported that a datanode would fail
to start if an IO error occurred on a single disk when running du during Datanode registration.
That Jira was closed due to HADOOP-12973 which refactored how du is called and prevents any
exception being thrown. However this problem can still occur if the volume has errors (eg
permission or filesystem corruption) when the disk is scanned to load all the replicas. The
method chain is:
> DataNode.initBlockPool -> FSDataSetImpl.addBlockPool -> FSVolumeList.getAllVolumesMap
-> Throws exception which goes unhandled.
> The DN logs will contain a stack trace for the problem volume, so the workaround is to
remove the volume from the DN config and the DN will start, but the logs are a little confusing,
so its always not obvious what the issue is.
> These are the cut down logs from an occurrence of this issue.
> {code}
> 2019-03-01 08:58:24,830 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Scanning block pool BP-240961797-x.x.x.x-1392827522027 on volume /data/18/dfs/dn/current...
> ...
> 2019-03-01 08:58:27,029 WARN org.apache.hadoop.fs.CachingGetSpaceUsed: Could not get
disk usage information
> ExitCodeException exitCode=1: du: cannot read directory `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir215':
Permission denied
> du: cannot read directory `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir213':
Permission denied
> du: cannot read directory `/data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir97/subdir25':
Permission denied
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:504)
> 	at org.apache.hadoop.fs.DU$DUShell.startRefresh(DU.java:61)
> 	at org.apache.hadoop.fs.DU.refresh(DU.java:53)
> 	at org.apache.hadoop.fs.CachingGetSpaceUsed.init(CachingGetSpaceUsed.java:84)
> 	at org.apache.hadoop.fs.GetSpaceUsed$Builder.build(GetSpaceUsed.java:166)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:145)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:881)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:412)
> ...
> 2019-03-01 08:58:27,043 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Time taken to scan block pool BP-240961797-x.x.x.x-1392827522027 on /data/18/dfs/dn/current:
2202ms
> {code}
> So we can see a du error occurred, was logged but not re-thrown (due to HADOOP-12973)
and the blockpool scan completed. However then in the 'add replicas to map' logic, we got
another exception stemming from the same problem:
> {code}
> 2019-03-01 08:58:27,564 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Adding replicas to map for block pool BP-240961797-x.x.x.x-1392827522027 on volume /data/18/dfs/dn/current...
> ...
> 2019-03-01 08:58:31,155 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Caught exception while adding replicas from /data/18/dfs/dn/current. Will throw later.
> java.io.IOException: Invalid directory or I/O error occurred for dir: /data/18/dfs/dn/current/BP-240961797-x.x.x.x-1392827522027/current/finalized/subdir149/subdir215
> 	at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1167)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:445)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addToReplicasMap(BlockPoolSlice.java:448)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.getVolumeMap(BlockPoolSlice.java:342)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getVolumeMap(FsVolumeImpl.java:861)
> 	at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$1.run(FsVolumeList.java:191)
> < The message 2019-03-01 08:59:00,989 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Time to add replicas to map for block pool BP-240961797-x.x.x.x-1392827522027 on volume xxx
did not appear for this volume as it failed >
> {code}
> The exception is re-thrown, so the DN fails registration and then retries. Then it finds
all volumes already locked and exits with a 'all volumes failed' error.
> I believe we should handle the failing volume like a runtime volume failure and only
abort the DN if too many volumes have failed.
> I will post a patch for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message