hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: TestRollingRestart fail occasionally
Date Thu, 11 Aug 2011 01:52:31 GMT
0.90.4 was just released. Can you run TestRollingRestart using latest 0.90
branch ?

This test didn't fail on
https://builds.apache.org/view/G-L/view/HBase/job/hbase-0.90/

Cheers

On Wed, Aug 10, 2011 at 6:21 PM, Zhoushuaifeng <zhoushuaifeng@huawei.com>wrote:

> Is there anyone else do this test and encounter the failure? And what's
> your opinion?
>
> Zhou Shuaifeng(Frank)
>
> -----Original Message-----
> From: Zhoushuaifeng [mailto:zhoushuaifeng@huawei.com]
> Sent: Wednesday, August 10, 2011 10:09 AM
> To: dev@hbase.apache.org
> Subject: TestRollingRestart fail occasionally
>
> Hi,
> I run TestRollingRestart(0.90.3), it fails occasionally.  The failing log
> shows that split log runs in to a circle, the recoverFileLease fail and the
> while() never end and the test timeout and fail.
>
> Here are some of the logs:
> After restarting primary master, not all the RSs connected before the
> master stop waiting:
> TRR: Restarting primary master
> INFO  [Master:0;linux1.site:35977] master.ServerManager(660): Waiting on
> regionserver(s) count to settle; currently=3
> 2011-07-06 09:12:56,331 INFO  [Master:0;linux1.site:35977]
> master.ServerManager(660): Waiting on regionserver(s) count to settle;
> currently=3
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977]
> master.ServerManager(648): Finished waiting for regionserver count to
> settle; count=3, sleptFor=4500
> 2011-07-06 09:12:57,831 INFO  [Master:0;linux1.site:35977]
> master.ServerManager(674): Exiting wait on regionserver(s) to checkin;
> count=3, stopped=false, count of regions out on cluster=22
> 2011-07-06 09:12:57,834 INFO  [Master:0;linux1.site:35977]
> master.MasterFileSystem(180): Log folder
> hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108
> doesn't belong to a known region server, splitting
>
> But after master starting split, another RS connected:
> 2011-07-06 09:13:54,243 INFO
>  [RegionServer:3;linux1.site,54949,1309914772108]
> regionserver.HRegionServer(1456): Attempting connect to Master server at
> linux1.site:35977
> 2011-07-06 09:13:54,243 INFO
>  [RegionServer:3;linux1.site,54949,1309914772108]
> regionserver.HRegionServer(1475): Connected to master at linux1.site:35977
>
> Then, split log recover lease may  encounter AlreadyBeingCreatedException
> and show this log:
> 2011-07-06 09:13:57,929 WARN  [Master:0;linux1.site:35977]
> util.FSUtils(715): Waited 60087ms for lease recovery on
> hdfs://localhost:41078/user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> failed to create file
> /user/root/.logs/linux1.site,54949,1309914772108/linux1.site%3A54949.1309914772175
> for DFSClient_hb_m_linux1.site:35977_1309914773252 on client 127.0.0.1,
> because this file is already being created by
> DFSClient_hb_rs_linux1.site,54949,1309914772108_1309914772161 on 127.0.0.1
>                at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
>                at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157)
>                at
> org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404)
>                at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown
> Source)
>                at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>                at java.lang.reflect.Method.invoke(Method.java:597)
>                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>                at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961)
>                at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957)
>                at java.security.AccessController.doPrivileged(Native
> Method)
>                at javax.security.auth.Subject.doAs(Subject.java:396)
>                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
>
> This log shows continuing about 14 minutes and test fail.
>
> This test fail occasionally may because the master default waiting time is
> 4500ms, usually it's enouth for all the RS to check in, but some times it's
> not, and the RS check in later may disturb the recover lease.
> This may be a bug, And may have some relation to HBASE-4177.
>
>
> Zhou Shuaifeng(Frank)
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message