hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Hbase Master Failover Issue
Date Sat, 14 May 2011 22:04:58 GMT
What did you do to solve it?
Thanks,
St.Ack

On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> Ok i think the issue is largely solved. Thanks for your help, guys.
>
> -d
>
> On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>> ok the problem seems to be multi-nic hosting on masters. the hbase
>> master starts up and uses canonical hostname to listen on which points
>> to a wrong nic. I am not sure why so i am not changign this but i am
>> struggling to override this at the moment as nothing seems to work
>> (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
>> possible combinatiosn... it probably has something to do with reverse
>> lookup so i added entry to hosts files to no avail so far. i will have
>> to talk to our admins to see why we can't switch the canonical host
>> name to ip that all the nodes are supposed to use it with .
>>
>> thanks.
>> -d
>>
>> On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>> Thanks, Jean-Daniel.
>>>
>>> Logs don't show anything abnormal (not even warnings). How soon you
>>> think the region servers should join?
>>>
>>> I am guessing the sequence should be something along the lines --
>>>  zookeeper needs to timeout old master session first (2 mins or so ) ,
>>> then hot spare should wean next master election (we probably should
>>> see that happening if we can tail its log, right?)
>>> and then the rest of the crowd should join in something like what
>>> seems to be governed by hbase.regionserver.msginterval property , if i
>>> read the code correctly?
>>>
>>> So all -in -all probably something like 3 minutes should warrant
>>> everybody has found the new master one way or another , right? if not,
>>> we have a problem, right?
>>>
>>> Thanks.
>>> -Dmitriy
>>>
>>> On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
>>> <jdcryans@apache.org> wrote:
>>>> Maybe there is something else in there, would be useful to see logs
>>>> from the region servers when you are shutting down master 1 and
>>>> bringing up master2.
>>>>
>>>> About "I have no failover for a critical component of my
>>>> infrastructure.", so is the Namenode, and for the moment you can't do
>>>> much about it. What's usually recommended is to put both the master
>>>> and the NN together on a more reliable machine. And the master ain't
>>>> that critical, almost everything works without it.
>>>>
>>>> J-D
>>>>
>>>> On Fri, May 13, 2011 at 12:08 PM, sean barden <sbarden@gmail.com> wrote:
>>>>> So I updated one of my clusters from CDHb1 to u0 with no issues(in the
>>>>> upgrade).  Hbase failed over to it's "backup" master server just find
>>>>> in the older version.  As 0.90.1+15.18, I had hoped the fix would be
>>>>> in u0 for the failover issue.  However, I'm having the same issue.
>>>>> master1 fails or I shut it down,  master2 waits for RS'es to check in
>>>>> forever.  Restarting the services for master2 and all RS's does
>>>>> nothing until I start up master1.  So, essentially, I have no failover
>>>>> for a critical component of my infrastructure.  Needless to say I'm
>>>>> exceptionally frustrated.  Any ideas to a fix or workaround would be
>>>>> greatly appreciated.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Sean
>>>>>
>>>>> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>> Upgrade to CDH3u0 which as far as I can tell has it:
>>>>>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Thu, May 5, 2011 at 9:55 AM, sean barden <sbarden@gmail.com>
wrote:
>>>>>>> Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like
an
>>>>>>> upgrade is in order.  Can you suggest a workaround?
>>>>>>>
>>>>>>> thx,
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545
>>>>>>>> which was fix in 0.90.2, which version are you testing?
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <sbarden@gmail.com>
wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm testing failing over from one master to another by
stopping
>>>>>>>>> master1(master2 is always running).  Master2 web i/f
kicks in and I can
>>>>>>>>> zk_dump but the region servers never show up.  Logs
on master2 show repeated
>>>>>>>>> entries below:
>>>>>>>>>
>>>>>>>>> 2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>>> 2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>>>
>>>>>>>>> Obviously the RS are not checking in.  Not sure why.
>>>>>>>>>
>>>>>>>>> Any ideas?
>>>>>>>>>
>>>>>>>>> thx,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sean Barden
>>>>>>>>> sbarden@gmail.com
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sean Barden
>>>>>>> sbarden@gmail.com
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sean Barden
>>>>> sbarden@gmail.com
>>>>>
>>>>
>>>
>>
>

Mime
View raw message