hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Adrien <a...@jeanjean.ch>
Subject Re: Regionserver fails to serve region
Date Tue, 21 Oct 2008 09:41:48 GMT

Hi,

I made some more observations about the (my) "Premeture" problem.
It is clearly the problem of HADOOP-3831
http://issues.apache.org/jira/browse/HADOOP-3831

The datanodes times up when they open the channel with hbase (8 min). (see
datanode log below)
Sometimes this error is reported to my client, when it happens during one of
my request, but it happens in several other occasions as seen in the
regionserver log (about every 10 minutes).

I said that restarting hbase yield to render access to region, restarting my
client is enough in fact.

Since my client (nor hbase) shouldn't prepare data for 8 minutes, I believe
it is either
- a I/O throughput problem in my case. 
- a kind of dead lock in channel; but other people would have noticed it. 

I monitor the I/O and CPU using iostats (10 seconds interval) and the hadoop
datanode log, and I have:


---- datanode log ----
2008-10-21 11:11:56,766 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.10:50010,
storageID=DS-969720570-192.168.1.10-50010-1218034818982, infoPort=50075,
ipcPort=50020):Got exception while serving blk_-321855630121782024_300805 to
/192.168.1.10:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.10:50010
remote=/192.168.1.10:44764]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
        at java.lang.Thread.run(Thread.java:619)

2008-10-21 11:11:56,767 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.10:50010,
storageID=DS-969720570-192.168.1.10-50010-1218034818982, infoPort=50075,
ipcPort=50020):DataXceiver: java.net.SocketTimeoutException: 480000 millis
timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.10:50010
remote=/192.168.1.10:44764]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
        at java.lang.Thread.run(Thread.java:619)

[...]

2008-10-21 11:15:52,614 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.10:50010,
storageID=DS-969720570-192.168.1.10-50010-1218034818982, infoPort=50075,
ipcPort=50020):Got exception while serving blk_6873767988458539960_302970 to
/192.168.1.10:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.10:50010
remote=/192.168.1.10:45482]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
        at java.lang.Thread.run(Thread.java:619)

2008-10-21 11:15:52,615 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.1.10:50010,
storageID=DS-969720570-192.168.1.10-50010-1218034818982, infoPort=50075,
ipcPort=50020):DataXceiver: java.net.SocketTimeoutException: 480000 millis
timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.1.10:50010
remote=/192.168.1.10:45482]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
        at
org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
        at java.lang.Thread.run(Thread.java:619)

I have between these messages bursts of some kind of normal operation: And
breaks of 2-3 minutes. 

e.g.
2008-10-21 11:18:23,126 INFO org.apache.hadoop.dfs.DataNode: Received block
blk_5598767900914020531_303199 of size 9 from /192.168.1.11
2008-10-21 11:18:23,126 INFO org.apache.hadoop.dfs.DataNode: PacketResponder
0 for block blk_5598767900914020531_303199 terminating
2008-10-21 11:18:23,372 INFO org.apache.hadoop.dfs.DataNode: Receiving block
blk_1637433000135864223_303201 src: /192.168.1.13:60729 dest: /192.

If we look at the iostat during the same period:
I observe typical

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.40    0.00    3.50    0.00    0.00   82.10

i.e. no iowait, and mostly idle during the last 10 minutes. Which make me
think that it is not a problem of performances.

Then I was thinking about what is special in my cluster, and I can notice
that I often update values in HBase tables by doing batch update with an
existing timestamp; but I cannot feel the correlation with our problem,
moreover the update works and data are not corrupted.

Another difference is that I havn't finalized my hadoop upgrade yet... Once
again, I can't see any correlation. Anyway that could be clues 

To be continued...


Stack, you ask me if my hard disks were full. I said one is. Why did you
link the above problem with that. Because of the du problem noticed in
HADOOP-3232 ? I don't think I'm affected by this problem, my BlockReport
process duration is less than a second. 

Note that on the other nodes there is hard disk space remaining, and I
observe the same Channel timeout problem.
Next step for me is to monitor the IO activity, and to try to see if there
is a correlation between the failures and some potential overload.

Another question by the way:
We saw that the hadoop-default.xml is used by hbase client, it overrides the
replication factor; ok. But could it override the dfs.datanode.du.reserved /
dfs.datanode.pct properties ? (which sounds to be policy of datanode rather
than client). I said that my settings doesn't seem to affect the behaviour
of datanodes.


Have a good day.
-- Jean-Adrien


Slava Gorelik wrote:
> 
> Hi.Most of the time i get Premeture [sic] EOF from inputStream , some
> times
> it also "No live nodes contain current block".
> No, I don't have memory issue.
> 
> Best Regards.
> 
> On Mon, Oct 20, 2008 at 7:46 PM, stack <stack@duboce.net> wrote:
> 
>> Slava Gorelik wrote:
>>
>>> Hi.I have similar problem.
>>> My configuration is 8 machines with 4gb ram with default heap size for
>>> hbase.
>>>
>>>
>>
>> Which part Slava?  You ran out of disk and you started to get "Premeture
>> [sic] EOF from inputStream"?  Or the NPEs?  Or you are seeing "No live
>> nodes
>> contain current block"?  You don't have J-A's memory issues I presume?
>>
>> St.Ack
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Regionserver-fails-to-serve-region-tp20028553p20086165.html
Sent from the HBase User mailing list archive at Nabble.com.


Mime
View raw message