hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay T <jay.pyl...@gmail.com>
Subject Re: Region Server failure due to remote data node errors
Date Mon, 30 Jul 2012 18:01:08 GMT
  Thanks for the quick reply Nicolas. We are using HBase 0.94 on Hadoop 
1.0.3.

I have uploaded the logs here:

Region Server  log: http://pastebin.com/QEQ22UnU
Data Node log:  http://pastebin.com/DF0JNL8K

Appreciate your help in figuring this out.

Thanks,
Jay



On 7/30/12 1:02 PM, N Keywal wrote:
> Hi Jay,
>
> Yes, the whole log would be interesting, plus the logs of the datanode
> on the same box as the dead RS.
> What's your hbase&  hdfs versions?
>
> The RS should be immune to hdfs errors. There are known issues (see
> HDFS-3701), but it seems you have something different...
> This:
>> java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949
>> remote=/10.128.204.225:50010]
> Seems to say that the error was between the datanode on the same box as the RS?
>
> Nicolas
>
> On Mon, Jul 30, 2012 at 6:43 PM, Jay T<jay.pylons@gmail.com>  wrote:
>>   A couple of our region servers (in a 16 node cluster) crashed due to
>> underlying Data Node errors. I am trying to understand how errors on remote
>> data nodes impact other region server processes.
>>
>> *To briefly describe what happened:
>> *
>> 1) Cluster was in operation. All 16 nodes were up, reads and writes were
>> happening extensively.
>> 2) Nodes 7 and 8 were shutdown for maintenance. (No graceful shutdown DN and
>> RS service were running and the power was just pulled out)
>> 3) Nodes 2 and 5 flushed and DFS client started reporting errors. From the
>> log it seems like DFS blocks were being replicated to the nodes that were
>> shutdown (7 and 8) and since replication could not go through successfully
>> DFS client raised errors on 2 and 5 and eventually the RS itself died.
>>
>> The question I am trying to get an answer for is : Is a Region Server immune
>> from remote data node errors (that are part of the replication pipeline) or
>> not. ?
>> *
>> Part of the Region Server Log:* (Node 5)
>>
>> 2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
>> createBlockOutputStream 10.128.204.225:50010 java.io.IOException: Bad
>> connect ack with firstBadLink
>> as 10.128.204.228:50010
>> 2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
>> block blk_-316956372096761177_489798
>> 2012-07-26 18:53:15,246 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
>> datanode 10.128.204.228:50010
>> 2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.StoreFile:
>> NO General Bloom and NO DeleteFamily was added to HFile
>> (hdfs://Node101:8020/hbase/table/754de060
>> c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da124)
>> 2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.Store:
>> Flushed , sequenceid=4046717645, memsize=256.5m, into tmp file
>> hdfs://Node101:8020/hbase/table/754de0
>> 60c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da1242012-07-26
>> 18:53:16,907 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming
>> flushed file at
>> hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/.tmp/26f5c
>> d1fb2cb4547972a31073d2da124 to
>> hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2da124
>> 2012-07-26 18:53:16,921 INFO org.apache.hadoop.hbase.regionserver.Store:
>> Added
>> hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2d
>> a124, entries=1137956, sequenceid=4046717645, filesize=13.2m2012-07-26
>> 18:53:32,048 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>> java.net.SocketTimeoutException: 15000 millis timeout while waiting for
>> channel to be ready for write. ch :
>> java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949
>> remote=/10.128.204.225:50010]
>>          at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>> at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>          at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>>          at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2857)
>> 2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_5116092240243398556_489796 bad datanode[0]
>> 10.128.204.225:50010
>> 2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_5116092240243398556_489796 in pipeline
>> 10.128.204.225:50010, 10.128.204.221:50010, 10.128.204.227:50010: bad
>> datanode 10.128.204.225:50010
>>
>> I can pastebin the entire log but this is when things started going wrong
>> for Node 5 and eventually shutdown hook for RS started and the RS was
>> shutdown.
>>
>> Any help in troubleshooting this is greatly appreciated.
>>
>> Thanks,
>> Jay


Mime
View raw message