hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: All region server died due to "Parent directory doesn't exist"
Date Wed, 19 Feb 2014 03:26:49 GMT
Was region server running on 10.147.1.168 ?

If so, can you pastebin region server log around 2014-02-18 01:17:31 ?

Thanks


On Tue, Feb 18, 2014 at 9:17 PM, takeshi <takeshi.miao@gmail.com> wrote:

> Hi, all
>
> We also encounter the issue on our POC cluster which seems like the issue
> yo were talking about, we also still not find the root cause yet, but would
> like to share some findings here which may help others' study.
>
> Environment:
> We are using cdh-4.2.1 (hadoop-2.0.0, hbase-0.94.2) as our base distro.
> with some bugfixes porting from other apache versions. running on 14
> machines with OS CentOS-5.3.
>
> The error msg we found for the issue
>
> HBase master reports "Parent directory doesn't exist" error
> {code:title=hbase-master.log}
> 334 [2014-02-18
>
> 01:20:30,550][master-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8100,1392703701863.archivedHFileCleaner][WARN
> ][org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache]: Snapshot
> di     rectory: hdfs://cluster1/user/SPN-hbase/.snapshot doesn't exist
> 335 [2014-02-18 01:20:36,597][IPC Server handler 47 on
> 8100][ERROR][org.apache.hadoop.hbase.master.HMaster]: Region server
> ^@^@aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
> reported a      fatal error:
> 336 ABORTING region server
> aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776: IOE
> in log roller
> 337 Cause:
> 338 java.io.IOException: Exception in createWriter
> 339         at
>
> org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:66)
> 340         at
>
> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:694)
> 341         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:631)
> 342         at
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
> 343         at java.lang.Thread.run(Thread.java:662)
> 344 Caused by: java.io.IOException: cannot get log writer
> 345         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:750)
> 346         at
>
> org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:60)
> 347         ... 4 more
> 348 Caused by: java.io.IOException: java.io.FileNotFoundException: Parent
> directory doesn't exist:
>
> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
> 349         at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1726)
> 350         at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1848)
> 351         at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1770)
> 352         at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1747)
> 353         at
>
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:418)
> 354         at
>
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205)
> 355         at
>
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068)
> 356         at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> 357         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
> 358         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
> 359         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
> 360         at java.security.AccessController.doPrivileged(Native Method)
> 361         at javax.security.auth.Subject.doAs(Subject.java:396)
> 362         at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> 363         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> 364
> 365         at
>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:173)
> 366         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:747)
> 367         ... 5 more
> {code}
>
> namenode mkdirs for "/<hbase-root>/.logs/<regionserver>" requested from
> principal "hbase/<fqdn>@LAB" , but remove the "/<hbase-root>/.logs" later
> by requested from "hdfs/<fqdn>@LAB" !
> {code:title=namenode.log (with hdfs audit enabled)}
> 624 2014-02-18 01:08:38,203 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
> ugi=hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB
> (auth:KERBEROS)
> ip=/
> 10.147.1.168              cmd=mkdirs
>
> src=/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
> dst=null        perm=hbase:supergroup:rwxr-xr-x
> ...
> 1068 2014-02-18 01:17:31,189 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
> ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB
> (auth:KERBEROS)
>  ip=/
> 10.147.1.168              cmd=delete      src=/user/SPN-hbase/.logs
> dst=null        perm=null
> ...
> 1343 2014-02-18 01:19:10,132 ERROR
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
> as:hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB
> (auth:KERBEROS)
> cause:org.apache
> .hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
>
> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.inter
> nal%2C8120%2C1392703714776.1392703718212 File does not exist. Holder
> DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
> 1344 2014-02-18 01:19:10,132 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 83 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> 10.147.1.168:35703: error: org.apache.hadoop.
> hdfs.server.namenode.LeaseExpiredException: No lease on
>
> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.internal%2C81
> 20%2C1392703714776.1392703718212 File does not exist. Holder
> DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
> 1345 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
> on
>
> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/
> aws-scottm-tmh6-1.ap-southeast-1.com
> pute.internal%2C8120%2C1392703714776.1392703718212 File does not exist.
> Holder DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
> 1346         at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2419)
> 1347         at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2410)
> 1348         at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2203)
> 1349         at
>
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:480)
> 1350         at
>
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> 1351         at
>
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
> 1352         at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> 1353         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
> 1354         at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
> 1355         at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
> 1356         at java.security.AccessController.doPrivileged(Native Method)
> 1357         at javax.security.auth.Subject.doAs(Subject.java:396)
> 1358         at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> 1359         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> {code}
>
> Way to reproduce:
> Currently, we have a way to reproduce this issue, which is running ycsb
> workload[a-f] to HBase, and the other hand, we simply to execute the
> service <hadoop,hbase-daemons> start. Then we keep to running 10 times in a
> loop on both side for one round, it could usually be reproduced on first or
> second round. But I am not sure whether it is related to the issue in the
> beginning of this discussion thread talked about.
>
>
> Follow-ups:
> Now I am following the weird msg appeared in the namenode.log
> {code}
> 2014-02-18 01:17:31,189 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
> ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB
> (auth:KERBEROS)
>  ip=/
> 10.147.1.168              cmd=delete      src=/user/SPN-hbase/.logs
> dst=null        perm=null
> {code}
> To see what can I dig out more...
>
>
> I'll keep to update this thread if any finding, and welcome any opinion,
> tks
>
>
> Best regards
>
> takeshi
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message