hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: [CDH3U0] Cluster not processing region server failover
Date Thu, 28 Apr 2011 19:05:44 GMT
Happy that you could figure out quickly and even happier that your
wrote back to the list with details.

Thanks!

J-D

On Thu, Apr 28, 2011 at 3:24 AM, Alex Romanovsky
<marginal.summer@gmail.com> wrote:
> Thank you a lot for your help Jean!
>
> It was a reverse DNS lookup issue - we recently changed our default
> domain suffix.
> I noticed that by looking up the server name from the "No HServerInfo
> found" message through the list returned by
> admin.getClusterStatus().getServerInfo().
>
> I'll drop the DNS cache on every cluster host now and restart the cluster.
>
> WBR,
> Alex Romanovsky
>
> P.S.
>> It did, why?
>
> My first message didn't appear in the list for a long time and I
> thought it could happen because I had sent it before I actually
> subscribed to the list. Really sorry for inconvenience.\
>
> On 4/27/11, Jean-Daniel Cryans <jdcryans@apache.org> wrote:
>> Hi Alex,
>>
>> Before answering I made sure it was working for me and it does. In
>> your master log after killing the -ROOT- region server you should see
>> lines like this:
>>
>> INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
>> RegionServer ephemeral node deleted, processing expiration
>> [servername]
>> DEBUG org.apache.hadoop.hbase.master.ServerManager: Added= servername
>> to dead servers, submitted shutdown handler to be executed, root=true,
>> meta=false
>> ...
>> INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting
>> ROOT region location in ZooKeeper
>> ...
>> DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
>> region -ROOT-,,0.70236052 to servername
>> ...
>>
>> Then when killing the .META. region server you would have some
>> equivalent lines such as:
>>
>> DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=servername
>> to dead servers, submitted shutdown handler to be executed,
>> root=false, meta=true
>> ...
>> DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
>> region .META.,,1.1028785192 to servername
>>
>> If it doesn't show, then there might be some other issue. Other comments
>> inline.
>>
>> J-D
>>
>> On Wed, Apr 27, 2011 at 4:03 AM, Alex Romanovsky
>> <marginal.summer@gmail.com> wrote:
>>> Hi,
>>>
>>> I am trying failover cases on a small 3-node fully-distributed cluster
>>> of the following topology:
>>> - master node - NameNode, JobTracker, QuorumPeerMain, HMaster;
>>> - slave nodes - DataNode, TaskTracker, QuorumPeerMain, HRegionServer.
>>>
>>> ROOT and META are initially served by two different nodes.
>>>
>>> I create table 'incr' with a single column family 'value', put 'incr',
>>> '00000000', 'value:main', '00000000' to achieve a 8-byte counter cell
>>> with still human readable content, then start calling
>>>
>>> $ incr 'incr', '00000000', 'value:main', 1
>>>
>>> once in a second or two. Then I kill -9 one of my region servers, the
>>> one that serves 'incr'.
>>>
>>> The subsequent shell incr times out. I terminate it with Ctrl-C,
>>> launch hbase-shell again and repeat the command, getting the following
>>> message repeated several times:
>>>
>>> 11/04/27 13:57:43 INFO ipc.HbaseRPC: Server at
>>> regionserver1/10.50.3.68:60020 could not be reached after 1 tries,
>>> giving up.
>>
>> That's somewhat expected, the shell is configured to not retry a lot
>> so the regions might not already be reassigned.
>>
>>>
>>> tail master log yields the following diagnostic:
>>>
>>> 2011-04-27 14:08:32,982 INFO
>>> org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance
>>> in 0ms. Moving 1 regions off of 1 overloaded servers onto 1 less
>>> loaded servers
>>> 2011-04-27 14:08:32,982 INFO org.apache.hadoop.hbase.master.HMaster:
>>> balance hri=incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.,
>>> src=regionserver1,60020,1303895356068,
>>> dest=regionserver2,60020,1303898049443
>>> 2011-04-27 14:08:32,982 DEBUG
>>> org.apache.hadoop.hbase.master.AssignmentManager: Starting
>>> unassignment of region
>>> incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7. (offlining)
>>> 2011-04-27 14:08:32,982 DEBUG
>>> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to
>>> unassign region incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7.
>>> but it is not currently assigned anywhere
>>
>> That's 11 minutes after you killed the region server right? Anything
>> else after 13:57:43?
>>
>>>
>>> hbase hbck finds 2 inconsistencies (regionserver1 down, region not
>>> served). hbase hbck -fix reports 2 initial and 1 eventual
>>> inconsistency, migrating the region to a live region server.
>>
>> How long after you killed the RS did you run this? Was anything shown
>> in the master log (like repeating lines) before that? If so, what?
>>
>>> However,
>>> when I repeat the test with regionserver2 and regionserver1 swapped
>>> (i.e. kill -9 the region server process on regionserver2, the initial
>>> evacuation target), hbcase hbck -fix throws
>>
>>>
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> setting up proxy interface
>>> org.apache.hadoop.hbase.ipc.HRegionInterface to
>>> regionserver2/10.50.3.68:60020 after attempts=1
>>>       at
>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1008)
>>>       at
>>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:172)
>>>       at
>>> org.apache.hadoop.hbase.util.HBaseFsck.getMetaEntries(HBaseFsck.java:746)
>>>       at org.apache.hadoop.hbase.util.HBaseFsck.doWork(HBaseFsck.java:133)
>>>       at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:989)
>>>
>>
>> So it seems that when you ran hbck the region server wasn't detected
>> as dead yet because hbck tried to connect to it.
>>
>>> zookeeper.session.timeout is set to 1000 ms (i.e. 1 second), and the
>>> configuration is consistent around the cluster, so these are not the
>>> causes.
>>
>> I won't rule that out until you prove us that it's the case. In your
>> logs you should have a line like this after starting a region server:
>>
>> INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
>> on server zk_server:2181, sessionid = 0xsome_hex, negotiated timeout =
>> 1000
>>
>> If not, then review that configuration.
>>
>>>
>>> Manual region reassignment also helps for the first time, and only for
>>> the first time. Subsequent retries leave 'incr' regions not assigned
>>> anywhere, and I cannot even query table regions on the client since
>>> HTable instances fail to connect.
>>>
>>> As soon as I restart the killed region server, cluster operation resumes.
>>> However, as far as I understand the HBase book, this is not the
>>> intended behavior. The cluster should automatically evacuate regions
>>> from dead region servers to known alive ones.
>>
>> It really seems like the region server was never considered dead. The
>> log should tell.
>>
>>>
>>> I run the cluster on RH 5, Sun JDK 1.6.0_24.
>>> JAVA_HOME=/usr/java/jdk1.6.0_24 in hadoop-env.sh (wonder whether I
>>> should duplicate the assignment in hbase-env.sh).
>>> Is this one of the issues known to be fixed in 0.90.2 or later
>>> releases? I grepped Jira and found no matching issues described;
>>> failover scenarios mentioned there are far more complex.
>>> What other logs or config files shall I check and/or post here?
>>
>> AFAIK this is not a known issue, and it works well for us. Feel free
>> to pastebin whole logs.
>>
>>>
>>> Reg.,
>>> Alex Romanovsky
>>> (message might appear duplicate; I apologize if it does so)
>>
>> It did, why?
>>
>

Mime
View raw message