hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Romanovsky <marginal.sum...@gmail.com>
Subject Re: [CDH3U0] Cluster not processing region server failover
Date Thu, 28 Apr 2011 10:24:01 GMT
Thank you a lot for your help Jean!

It was a reverse DNS lookup issue - we recently changed our default
domain suffix.
I noticed that by looking up the server name from the "No HServerInfo
found" message through the list returned by
admin.getClusterStatus().getServerInfo().

I'll drop the DNS cache on every cluster host now and restart the cluster.

WBR,
Alex Romanovsky

P.S.
> It did, why?

My first message didn't appear in the list for a long time and I
thought it could happen because I had sent it before I actually
subscribed to the list. Really sorry for inconvenience.\

On 4/27/11, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
> Hi Alex,
>
> Before answering I made sure it was working for me and it does. In
> your master log after killing the -ROOT- region server you should see
> lines like this:
>
> INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
> RegionServer ephemeral node deleted, processing expiration
> [servername]
> DEBUG org.apache.hadoop.hbase.master.ServerManager: Added= servername
> to dead servers, submitted shutdown handler to be executed, root=true,
> meta=false
> ...
> INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting
> ROOT region location in ZooKeeper
> ...
> DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
> region -ROOT-,,0.70236052 to servername
> ...
>
> Then when killing the .META. region server you would have some
> equivalent lines such as:
>
> DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=servername
> to dead servers, submitted shutdown handler to be executed,
> root=false, meta=true
> ...
> DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
> region .META.,,1.1028785192 to servername
>
> If it doesn't show, then there might be some other issue. Other comments
> inline.
>
> J-D
>
> On Wed, Apr 27, 2011 at 4:03 AM, Alex Romanovsky
> <marginal.summer@gmail.com> wrote:
>> Hi,
>>
>> I am trying failover cases on a small 3-node fully-distributed cluster
>> of the following topology:
>> - master node - NameNode, JobTracker, QuorumPeerMain, HMaster;
>> - slave nodes - DataNode, TaskTracker, QuorumPeerMain, HRegionServer.
>>
>> ROOT and META are initially served by two different nodes.
>>
>> I create table 'incr' with a single column family 'value', put 'incr',
>> '00000000', 'value:main', '00000000' to achieve a 8-byte counter cell
>> with still human readable content, then start calling
>>
>> $ incr 'incr', '00000000', 'value:main', 1
>>
>> once in a second or two. Then I kill -9 one of my region servers, the
>> one that serves 'incr'.
>>
>> The subsequent shell incr times out. I terminate it with Ctrl-C,
>> launch hbase-shell again and repeat the command, getting the following
>> message repeated several times:
>>
>> 11/04/27 13:57:43 INFO ipc.HbaseRPC: Server at
>> regionserver1/10.50.3.68:60020 could not be reached after 1 tries,
>> giving up.
>
> That's somewhat expected, the shell is configured to not retry a lot
> so the regions might not already be reassigned.
>
>>
>> tail master log yields the following diagnostic:
>>
>> 2011-04-27 14:08:32,982 INFO
>> org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance
>> in 0ms. Moving 1 regions off of 1 overloaded servers onto 1 less
>> loaded servers
>> 2011-04-27 14:08:32,982 INFO org.apache.hadoop.hbase.master.HMaster:
>> balance hri=incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.,
>> src=regionserver1,60020,1303895356068,
>> dest=regionserver2,60020,1303898049443
>> 2011-04-27 14:08:32,982 DEBUG
>> org.apache.hadoop.hbase.master.AssignmentManager: Starting
>> unassignment of region
>> incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7. (offlining)
>> 2011-04-27 14:08:32,982 DEBUG
>> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to
>> unassign region incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.
>> but it is not currently assigned anywhere
>
> That's 11 minutes after you killed the region server right? Anything
> else after 13:57:43?
>
>>
>> hbase hbck finds 2 inconsistencies (regionserver1 down, region not
>> served). hbase hbck -fix reports 2 initial and 1 eventual
>> inconsistency, migrating the region to a live region server.
>
> How long after you killed the RS did you run this? Was anything shown
> in the master log (like repeating lines) before that? If so, what?
>
>> However,
>> when I repeat the test with regionserver2 and regionserver1 swapped
>> (i.e. kill -9 the region server process on regionserver2, the initial
>> evacuation target), hbcase hbck -fix throws
>
>>
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>> setting up proxy interface
>> org.apache.hadoop.hbase.ipc.HRegionInterface to
>> regionserver2/10.50.3.68:60020 after attempts=1
>>       at
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1008)
>>       at
>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:172)
>>       at
>> org.apache.hadoop.hbase.util.HBaseFsck.getMetaEntries(HBaseFsck.java:746)
>>       at org.apache.hadoop.hbase.util.HBaseFsck.doWork(HBaseFsck.java:133)
>>       at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:989)
>>
>
> So it seems that when you ran hbck the region server wasn't detected
> as dead yet because hbck tried to connect to it.
>
>> zookeeper.session.timeout is set to 1000 ms (i.e. 1 second), and the
>> configuration is consistent around the cluster, so these are not the
>> causes.
>
> I won't rule that out until you prove us that it's the case. In your
> logs you should have a line like this after starting a region server:
>
> INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
> on server zk_server:2181, sessionid = 0xsome_hex, negotiated timeout =
> 1000
>
> If not, then review that configuration.
>
>>
>> Manual region reassignment also helps for the first time, and only for
>> the first time. Subsequent retries leave 'incr' regions not assigned
>> anywhere, and I cannot even query table regions on the client since
>> HTable instances fail to connect.
>>
>> As soon as I restart the killed region server, cluster operation resumes.
>> However, as far as I understand the HBase book, this is not the
>> intended behavior. The cluster should automatically evacuate regions
>> from dead region servers to known alive ones.
>
> It really seems like the region server was never considered dead. The
> log should tell.
>
>>
>> I run the cluster on RH 5, Sun JDK 1.6.0_24.
>> JAVA_HOME=/usr/java/jdk1.6.0_24 in hadoop-env.sh (wonder whether I
>> should duplicate the assignment in hbase-env.sh).
>> Is this one of the issues known to be fixed in 0.90.2 or later
>> releases? I grepped Jira and found no matching issues described;
>> failover scenarios mentioned there are far more complex.
>> What other logs or config files shall I check and/or post here?
>
> AFAIK this is not a known issue, and it works well for us. Feel free
> to pastebin whole logs.
>
>>
>> Reg.,
>> Alex Romanovsky
>> (message might appear duplicate; I apologize if it does so)
>
> It did, why?
>

Mime
View raw message