hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhoushuaifeng <zhoushuaif...@huawei.com>
Subject RE: TestRollingRestart fail occasionally
Date Tue, 16 Aug 2011 11:49:17 GMT
My hadoop core is 0.20-append-r1056497
Shall we need a issue and I attach more log to analysis?

Zhou Shuaifeng(Frank)

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Tuesday, August 16, 2011 12:13 PM
To: dev@hbase.apache.org
Subject: Re: TestRollingRestart fail occasionally

If you update the hadoop that hbase ships with to the tip of the
branch-0.20-append, does it fail then?  The tip has hdfs-1554,
hdfs-1554 whereas what hbase ships with does not.


On Tue, Aug 9, 2011 at 7:09 PM, Zhoushuaifeng <zhoushuaifeng@huawei.com> wrote:
> Hi,
> I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log shows that
split log runs in to a circle, the recoverFileLease fail and the while() never end and the
test timeout and fail.
> Here are some of the logs:
> After restarting primary master, not all the RSs connected before the master stop waiting:
> TRR: Restarting primary master
> INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on regionserver(s)
count to settle; currently=3
> 2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977] master.ServerManager(660):
Waiting on regionserver(s) count to settle; currently=3
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] master.ServerManager(648):
Finished waiting for regionserver count to settle; count=3, sleptFor=4500
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977] master.ServerManager(674):
Exiting wait on regionserver(s) to checkin; count=3, stopped=false, count of regions out on
> 2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977] master.MasterFileSystem(180):
Log folder hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108 doesn't
belong to a known region server, splitting
> But after master starting split, another RS connected:
> 2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] regionserver.HRegionServer(1456):
Attempting connect to Master server at linux1.site:35977
> 2011-07-06 09:13:54,243 INFO  [RegionServer:3;linux1.site,54949,1309914772108] regionserver.HRegionServer(1475):
Connected to master at linux1.site:35977
> Then, split log recover lease may  encounter AlreadyBeingCreatedException and show this
> 2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977] util.FSUtils(715): Waited
60087ms for lease recovery on hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
failed to create file /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175
for DFSClient_hb_m_linux1.site:35977_1309914773252 on client, because this file
is already being created by DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161
>                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
>                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
>                at org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
>                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>                at java.lang.reflect.Method.invoke(Method.java:597)
>                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
>                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
>                at java.security.AccessController.doPrivileged(Native Method)
>                at javax.security.auth.Subject.doAs(Subject.java:396)
>                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
> This log shows continuing about 14 minutes and test fail.
> This test fail occasionally may because the master default waiting time is 4500ms, usually
it's enouth for all the RS to check in, but some times it's not, and the RS check in later
may disturb the recover lease.
> This may be a bug, And may have some relation to HBASE-4177.
> Zhou Shuaifeng(Frank)

View raw message