hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatsuya Kawano <tatsuya6...@gmail.com>
Subject Re: A kernel panic makes small HBase cluster to crush?
Date Thu, 10 Mar 2011 11:41:31 GMT
Hi,

I suggested him to upgrade his environment to the latest version, so
at this time, he used CDH3b4 (HBase 0.90.1) and performed the same
test procedure. Then now he got a new issue. HMaster was aborted
because it couldn't reach to the host that had the kernel panic.

Can anybody verify this issue for us?
You can just issue "echo c > /proc/sysrq-trigger" on a worker node
running region server, and check what would happen after a couple of
minutes.

Thanks,

---------------------------------------------------------------------------------------------------
2011-03-10 07:48:39,192 FATAL org.apache.hadoop.hbase.master.HMaster:
Remote unexpected exception
java.net.NoRouteToHostException: No route to host
       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
       at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
       at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:
206)
       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
       at org.apache.hadoop.hbase.ipc.HBaseClient
$Connection.setupIOstreams(HBaseClient.java:328)
       at
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:
883)
       at
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
       at org.apache.hadoop.hbase.ipc.HBaseRPC
$Invoker.invoke(HBaseRPC.java:257)
       at $Proxy6.closeRegion(Unknown Source)
       at
org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:
589)
       at
org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:
1093)
       at
org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:
1040)
       at
org.apache.hadoop.hbase.master.AssignmentManager.balance(AssignmentManager.java:
1831)
       at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:
692)
       at org.apache.hadoop.hbase.master.HMaster$1.chore(HMaster.java:
583)
       at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
2011-03-10 07:48:39,192 INFO org.apache.hadoop.hbase.master.HMaster:
Aborting
2011-03-10 07:48:39,192 INFO org.apache.hadoop.hbase.master.HMaster:
balance hri=SpecialObject_Speed_Test,,
1299710751983.f0e5544339870a510c338b3029979d3e.,
src=ap13.secur2,60020,1299710609447,
dest=ap12.secur2,60020,1299710609148
2011-03-10 07:48:39,192 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Starting
unassignment of region SpecialObject_Speed_Test,,
1299710751983.f0e5544339870a510c338b3029979d3e. (offlining)
2011-03-10 07:48:39,852 DEBUG org.apache.hadoop.hbase.master.HMaster:
Stopping service threads
2011-03-10 07:48:39,852 INFO org.apache.hadoop.ipc.HBaseServer:
Stopping server on 60000
2011-03-10 07:48:39,852 FATAL org.apache.hadoop.hbase.master.HMaster:
Remote unexpected exception
java.io.InterruptedIOException: Interruped while waiting for IO on
channel java.nio.channels.SocketChannel[connection-pending remote=/
10.X.X.18:60020]. 19340 millis timeout left.
       at org.apache.hadoop.net.SocketIOWithTimeout
$SelectorPool.select(SocketIOWithTimeout.java:349)
       at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:
203)
       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
       at org.apache.hadoop.hbase.ipc.HBaseClient
$Connection.setupIOstreams(HBaseClient.java:328)
       at
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:
883)
       at
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
       at org.apache.hadoop.hbase.ipc.HBaseRPC
$Invoker.invoke(HBaseRPC.java:257)
       at $Proxy6.closeRegion(Unknown Source)
       at
org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:
589)
       at
org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:
1093)
       at
org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:
1040)
       at
org.apache.hadoop.hbase.master.AssignmentManager.balance(AssignmentManager.java:
1831)
       at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:
692)
       at org.apache.hadoop.hbase.master.HMaster$1.chore(HMaster.java:
583)
       at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
2011-03-10 07:48:39,852 INFO org.apache.hadoop.hbase.master.HMaster:
Aborting
---------------------------------------------------------------------------------------------------


-- 
Tatsuya Kawano
Tokyo, Japan


2011/3/5 Tatsuya Kawano <tatsuya6502@gmail.com>:
> Thanks for checking the HDFS code.
>
>>> Also it's strange that the region servers got corrupted reads when there are
two more replicase available on HDFS.
>>
>> Corrupted reads? This is a loaded term, are you really saying that the
>> region server read corrupted data from HDFS?
>
> Sorry, it was too early to say the read data was corrupted. But the
> other region servers had to shut themselves down because they detected
> there were something wrong with HFiles.
>
> "ABORTING region server serverName=ap12.secur2,60020,1298987576087,
> load=(requests=0, regions=4, usedHeap=218, maxHeap=1998): Replay of
> HLog required.
> Forcing server shutdown"
>
>
> I asked the guy to watch the data node and name node status if he can
> run the same test again. He hasn't came back to me yet.
>
> Thanks,
> Tatsuya
>
>
> 2011/3/5 Jean-Daniel Cryans <jdcryans@apache.org>:
>> (heh this thread gives me a reason to look at the HDFS code)
>>
>>> Well, doen't the following message imply HDFS could accept writes when it has
at least 1 data node available?
>>>
>>>> error: java.io.IOException: File /hbase/Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621
could only be replicated to 0 nodes, instead of 1
>>
>> This is how that message is constructed:
>>
>>       throw new IOException("File " + src + " could only be replicated to " +
>>                           targets.length + " nodes, instead of "
>>                           minReplication);
>>
>> minReplication is the number of replicas needed in order to accept a
>> write, by default 1. In this case, it wasn't able to place the block
>> anywhere for an unknown reason.
>>
>>>
>>> Also it's strange that the region servers got corrupted reads when there are
two more replicase available on HDFS.
>>
>> Corrupted reads? This is a loaded term, are you really saying that the
>> region server read corrupted data from HDFS?
>>
>> J-D
>>
>
>
>
> --
> 河野 達也
> Tatsuya Kawano (Mr.)
> Tokyo, Japan
>
> twitter: http://twitter.com/tatsuya6502

Mime
View raw message