hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Hbase Master Failover Issue
Date Mon, 16 May 2011 18:59:10 GMT
Hey Dmitriy,

Awesome you could figure it out. I wonder if there's something that
could be done in HBase to help debugging such problems... Suggestions?

Also, just to make sure, this thread was started by Sean and it seems
you stepped up for him... you are working together right? At least
that's what Rapportive tells me, but still trying to make sure we
didn't forget someone else's problem.

Good on you,

J-D

On Sun, May 15, 2011 at 12:50 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> The problem was multinic configuration at master nodes.
>
> i saw that the processes starts listening on a wrong NIC
>
> I read the source code and saw that with default settings it would use
> whatever ip is reported by canonical hostname, i.e. whatever retruned
> by something like
>
> ping `hostname`,
>
>
> our canonical hostname was resolving of course  the wrong nic.
>
> i kind of did not want to edit /etc/hostsnames (i guessed our admins
> had a reason to point hostname to that nic), so i forcefully set
> 'eth0' as hbase.master.dns.interface (if i remember that property name
> correctly).
>
> it started listening on what was pointed by eth0:0 isntead of eth0
> which solved the problem anyway.
>
> (funny thing though i still couldn't make it listen on eth0 ip, but
> rather on eth0:0 only although both had reverse dns. apparently
> whatever native code is used, lists both ips for that interface and
> then the first one that has reverse dns is used, so there's no way to
> force it to listen on other ones).
>
> Bottom line, with multinic configurations your hostname better points
> to the ip you want it to listen on in /etc/hosts. If it's different,
> one cannot use the default configuration.
>
> -d
>
> On Sat, May 14, 2011 at 3:04 PM, Stack <stack@duboce.net> wrote:
>> What did you do to solve it?
>> Thanks,
>> St.Ack
>>
>> On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>> Ok i think the issue is largely solved. Thanks for your help, guys.
>>>
>>> -d
>>>
>>> On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>>> ok the problem seems to be multi-nic hosting on masters. the hbase
>>>> master starts up and uses canonical hostname to listen on which points
>>>> to a wrong nic. I am not sure why so i am not changign this but i am
>>>> struggling to override this at the moment as nothing seems to work
>>>> (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
>>>> possible combinatiosn... it probably has something to do with reverse
>>>> lookup so i added entry to hosts files to no avail so far. i will have
>>>> to talk to our admins to see why we can't switch the canonical host
>>>> name to ip that all the nodes are supposed to use it with .
>>>>
>>>> thanks.
>>>> -d
>>>>
>>>> On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:
>>>>> Thanks, Jean-Daniel.
>>>>>
>>>>> Logs don't show anything abnormal (not even warnings). How soon you
>>>>> think the region servers should join?
>>>>>
>>>>> I am guessing the sequence should be something along the lines --
>>>>>  zookeeper needs to timeout old master session first (2 mins or so )
,
>>>>> then hot spare should wean next master election (we probably should
>>>>> see that happening if we can tail its log, right?)
>>>>> and then the rest of the crowd should join in something like what
>>>>> seems to be governed by hbase.regionserver.msginterval property , if
i
>>>>> read the code correctly?
>>>>>
>>>>> So all -in -all probably something like 3 minutes should warrant
>>>>> everybody has found the new master one way or another , right? if not,
>>>>> we have a problem, right?
>>>>>
>>>>> Thanks.
>>>>> -Dmitriy
>>>>>
>>>>> On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
>>>>> <jdcryans@apache.org> wrote:
>>>>>> Maybe there is something else in there, would be useful to see logs
>>>>>> from the region servers when you are shutting down master 1 and
>>>>>> bringing up master2.
>>>>>>
>>>>>> About "I have no failover for a critical component of my
>>>>>> infrastructure.", so is the Namenode, and for the moment you can't
do
>>>>>> much about it. What's usually recommended is to put both the master
>>>>>> and the NN together on a more reliable machine. And the master ain't
>>>>>> that critical, almost everything works without it.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Fri, May 13, 2011 at 12:08 PM, sean barden <sbarden@gmail.com>
wrote:
>>>>>>> So I updated one of my clusters from CDHb1 to u0 with no issues(in
the
>>>>>>> upgrade).  Hbase failed over to it's "backup" master server
just find
>>>>>>> in the older version.  As 0.90.1+15.18, I had hoped the fix
would be
>>>>>>> in u0 for the failover issue.  However, I'm having the same
issue.
>>>>>>> master1 fails or I shut it down,  master2 waits for RS'es to
check in
>>>>>>> forever.  Restarting the services for master2 and all RS's does
>>>>>>> nothing until I start up master1.  So, essentially, I have no
failover
>>>>>>> for a critical component of my infrastructure.  Needless to
say I'm
>>>>>>> exceptionally frustrated.  Any ideas to a fix or workaround
would be
>>>>>>> greatly appreciated.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>> On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>>>> Upgrade to CDH3u0 which as far as I can tell has it:
>>>>>>>> http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Thu, May 5, 2011 at 9:55 AM, sean barden <sbarden@gmail.com>
wrote:
>>>>>>>>> Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks
like an
>>>>>>>>> upgrade is in order.  Can you suggest a workaround?
>>>>>>>>>
>>>>>>>>> thx,
>>>>>>>>>
>>>>>>>>> Sean
>>>>>>>>>
>>>>>>>>> On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans <jdcryans@apache.org>
wrote:
>>>>>>>>>> This sounds like https://issues.apache.org/jira/browse/HBASE-3545
>>>>>>>>>> which was fix in 0.90.2, which version are you testing?
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Thu, May 5, 2011 at 9:23 AM, sean barden <sbarden@gmail.com>
wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I'm testing failing over from one master to another
by stopping
>>>>>>>>>>> master1(master2 is always running).  Master2
web i/f kicks in and I can
>>>>>>>>>>> zk_dump but the region servers never show up.
 Logs on master2 show repeated
>>>>>>>>>>> entries below:
>>>>>>>>>>>
>>>>>>>>>>> 2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>>>>> 2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>>>>> Waiting on regionserver(s) to checkin
>>>>>>>>>>>
>>>>>>>>>>> Obviously the RS are not checking in.  Not sure
why.
>>>>>>>>>>>
>>>>>>>>>>> Any ideas?
>>>>>>>>>>>
>>>>>>>>>>> thx,
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Sean Barden
>>>>>>>>>>> sbarden@gmail.com
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sean Barden
>>>>>>>>> sbarden@gmail.com
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sean Barden
>>>>>>> sbarden@gmail.com
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message