hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatsuya Kawano <tatsuya6...@gmail.com>
Subject Re: A kernel panic makes small HBase cluster to crush?
Date Thu, 10 Mar 2011 23:38:59 GMT

Hi Stack, 

Thanks for checking this issue and filing HBASE-3617. Well, that command was supposed the
node to crash and shutdown. I'll check the detailed procedure and try to reproduce this issue
during weekend. 


> This is odd.  Communication with the RegionServer was working fine up
> until it crashed?  On crash, the Master starts doing NRTHE?  

Yes. NRTHE occured about two minutes after the RS crash. He tried the same test procedure
twice and got the same result.


> Master root filesystem is not full?

No, it shouldn't be full. I asked him to watch the disk space and network connection in very
early stage of our conversation. 


> Try to figure more on why the NRTHE above happened Tatsuya, if you can.

Sure. Let me work on it. I'll have some time in Saturday and Sunday morning to set up a test
cluster and play with the issue. 

Thanks,

--
Tatsuya Kawano
Tokyo, Japan


On Mar 11, 2011, at 6:51 AM, Stack <stack@duboce.net> wrote:

> On Thu, Mar 10, 2011 at 3:41 AM, Tatsuya Kawano <tatsuya6502@gmail.com> wrote:
>> I suggested him to upgrade his environment to the latest version, so
>> at this time, he used CDH3b4 (HBase 0.90.1) and performed the same
>> test procedure. Then now he got a new issue. HMaster was aborted
>> because it couldn't reach to the host that had the kernel panic.
>> 
>> Can anybody verify this issue for us?
>> You can just issue "echo c > /proc/sysrq-trigger" on a worker node
>> running region server, and check what would happen after a couple of
>> minutes.
>> 
> 
> I did the above Tatsuya and saw this in the RS messages log:
> 
> Mar 10 10:25:46 sv4borg228 kernel: [1189382.838243] SysRq : Trigger a crashdump
> 
> ... but all just kept chugging along.
> 
> (The RS stays up).
> 
> 
>> ---------------------------------------------------------------------------------------------------
>> 2011-03-10 07:48:39,192 FATAL org.apache.hadoop.hbase.master.HMaster:
>> Remote unexpected exception
>> java.net.NoRouteToHostException: No route to host
> 
> This is odd.  Communication with the RegionServer was working fine up
> until it crashed?  On crash, the Master starts doing NRTHE?  Master
> root filesystem is not full?
> 
> Checking code, this exception will not be caught and it will trigger a
> Master abort.  Thats a problem.  I opened
> https://issues.apache.org/jira/browse/HBASE-3617  Will fix for 0.90.2.
> 
> Try to figure more on why the NRTHE above happened Tatsuya, if you can.
> 
> St.Ack


Mime
View raw message