hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From takeshi <takeshi.m...@gmail.com>
Subject Re: All region server died due to "Parent directory doesn't exist"
Date Wed, 19 Feb 2014 03:17:55 GMT
Hi, all

We also encounter the issue on our POC cluster which seems like the issue
yo were talking about, we also still not find the root cause yet, but would
like to share some findings here which may help others' study.

Environment:
We are using cdh-4.2.1 (hadoop-2.0.0, hbase-0.94.2) as our base distro.
with some bugfixes porting from other apache versions. running on 14
machines with OS CentOS-5.3.

The error msg we found for the issue

HBase master reports "Parent directory doesn't exist" error
{code:title=hbase-master.log}
334 [2014-02-18
01:20:30,550][master-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8100,1392703701863.archivedHFileCleaner][WARN
][org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache]: Snapshot
di     rectory: hdfs://cluster1/user/SPN-hbase/.snapshot doesn't exist
335 [2014-02-18 01:20:36,597][IPC Server handler 47 on
8100][ERROR][org.apache.hadoop.hbase.master.HMaster]: Region server
^@^@aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
reported a      fatal error:
336 ABORTING region server
aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776: IOE
in log roller
337 Cause:
338 java.io.IOException: Exception in createWriter
339         at
org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:66)
340         at
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:694)
341         at
org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:631)
342         at
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
343         at java.lang.Thread.run(Thread.java:662)
344 Caused by: java.io.IOException: cannot get log writer
345         at
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:750)
346         at
org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:60)
347         ... 4 more
348 Caused by: java.io.IOException: java.io.FileNotFoundException: Parent
directory doesn't exist:
/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
349         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1726)
350         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1848)
351         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1770)
352         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1747)
353         at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:418)
354         at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205)
355         at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068)
356         at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
357         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
358         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
359         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
360         at java.security.AccessController.doPrivileged(Native Method)
361         at javax.security.auth.Subject.doAs(Subject.java:396)
362         at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
363         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
364
365         at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:173)
366         at
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:747)
367         ... 5 more
{code}

namenode mkdirs for "/<hbase-root>/.logs/<regionserver>" requested from
principal "hbase/<fqdn>@LAB" , but remove the "/<hbase-root>/.logs" later
by requested from "hdfs/<fqdn>@LAB" !
{code:title=namenode.log (with hdfs audit enabled)}
624 2014-02-18 01:08:38,203 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
ugi=hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB(auth:KERBEROS)
ip=/
10.147.1.168              cmd=mkdirs
src=/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
dst=null        perm=hbase:supergroup:rwxr-xr-x
...
1068 2014-02-18 01:17:31,189 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB(auth:KERBEROS)
 ip=/
10.147.1.168              cmd=delete      src=/user/SPN-hbase/.logs
dst=null        perm=null
...
1343 2014-02-18 01:19:10,132 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB(auth:KERBEROS)
cause:org.apache
.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.inter
nal%2C8120%2C1392703714776.1392703718212 File does not exist. Holder
DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
1344 2014-02-18 01:19:10,132 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 83 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
10.147.1.168:35703: error: org.apache.hadoop.
hdfs.server.namenode.LeaseExpiredException: No lease on
/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.internal%2C81
20%2C1392703714776.1392703718212 File does not exist. Holder
DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
1345 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on
/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/
aws-scottm-tmh6-1.ap-southeast-1.com
pute.internal%2C8120%2C1392703714776.1392703718212 File does not exist.
Holder DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
1346         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2419)
1347         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2410)
1348         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2203)
1349         at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:480)
1350         at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
1351         at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
1352         at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
1353         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
1354         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
1355         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
1356         at java.security.AccessController.doPrivileged(Native Method)
1357         at javax.security.auth.Subject.doAs(Subject.java:396)
1358         at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
1359         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
{code}

Way to reproduce:
Currently, we have a way to reproduce this issue, which is running ycsb
workload[a-f] to HBase, and the other hand, we simply to execute the
service <hadoop,hbase-daemons> start. Then we keep to running 10 times in a
loop on both side for one round, it could usually be reproduced on first or
second round. But I am not sure whether it is related to the issue in the
beginning of this discussion thread talked about.


Follow-ups:
Now I am following the weird msg appeared in the namenode.log
{code}
2014-02-18 01:17:31,189 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB(auth:KERBEROS)
 ip=/
10.147.1.168              cmd=delete      src=/user/SPN-hbase/.logs
dst=null        perm=null
{code}
To see what can I dig out more...


I'll keep to update this thread if any finding, and welcome any opinion, tks


Best regards

takeshi

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message