hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject Re: is my hbase cluster overloaded?
Date Tue, 22 Apr 2014 07:53:47 GMT
hbase current statistics:

Region Servers
ServerName Start time Load
app-hbase-1,60020,1398141516916 Tue Apr 22 12:38:36 CST 2014
requestsPerSecond=6100, numberOfOnlineRegions=7, usedHeapMB=1201,
maxHeapMB=7948
app-hbase-2,60020,1398141516914 Tue Apr 22 12:38:36 CST 2014
requestsPerSecond=1770, numberOfOnlineRegions=4, usedHeapMB=224,
maxHeapMB=7948
app-hbase-4,60020,1398141525533 Tue Apr 22 12:38:45 CST 2014
requestsPerSecond=3445, numberOfOnlineRegions=5, usedHeapMB=798,
maxHeapMB=7948
app-hbase-5,60020,1398141524870 Tue Apr 22 12:38:44 CST 2014
requestsPerSecond=57, numberOfOnlineRegions=2, usedHeapMB=328,
maxHeapMB=7948
Total: servers: 4 requestsPerSecond=11372, numberOfOnlineRegions=18

On Tue, Apr 22, 2014 at 3:40 PM, Li Li <fancyerii@gmail.com> wrote:
> I am now restart the sever and running. maybe an hour later the load
> will become high
>
> On Tue, Apr 22, 2014 at 3:02 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
>> Do you still have the same issue?
>>
>> and:
>> -Xmx8000m -server -XX:NewSize=512m -XX:MaxNewSize=512m
>>
>> the Eden size is too small.
>>
>>
>>
>> On Tue, Apr 22, 2014 at 2:55 PM, Li Li <fancyerii@gmail.com> wrote:
>>
>>> <property>
>>>   <name>dfs.datanode.handler.count</name>
>>>   <value>100</value>
>>>   <description>The number of server threads for the datanode.</description>
>>> </property>
>>>
>>>
>>> 1. namenode/master  192.168.10.48
>>> http://pastebin.com/7M0zzAAc
>>>
>>> $free -m (this is value when I restart the hadoop and hbase now, not
>>> the value when it crashed)
>>>              total       used       free     shared    buffers     cached
>>> Mem:         15951       3819      12131          0        509       1990
>>> -/+ buffers/cache:       1319      14631
>>> Swap:         8191          0       8191
>>>
>>> 2. datanode/region 192.168.10.45
>>> http://pastebin.com/FiAw1yju
>>>
>>> $free -m
>>>              total       used       free     shared    buffers     cached
>>> Mem:         15951       3627      12324          0       1516        641
>>> -/+ buffers/cache:       1469      14482
>>> Swap:         8191          8       8183
>>>
>>> On Tue, Apr 22, 2014 at 2:29 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
>>> > one big possible issue is that you have a high concurrent request on HDFS
>>> > or HBASE, then all Data nodes handlers are all busy, then more requests
>>> are
>>> > pending, then timeout, so you can try to increase
>>> > dfs.datanode.handler.count and dfs.namenode.handler.count in the
>>> > hdfs-site.xml, then restart the HDFS.
>>> >
>>> > another, do you have datanode, namenode, region servers JVM options? if
>>> > they are all by default, then there is also have this issue.
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Apr 22, 2014 at 2:20 PM, Li Li <fancyerii@gmail.com> wrote:
>>> >
>>> >> my cluster setup: both 6 machines are virtual machine. each machine:
>>> >> 4CPU Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16GB memory
>>> >> 192.168.10.48 namenode/jobtracker
>>> >> 192.168.10.47 secondary namenode
>>> >> 192.168.10.45 datanode/tasktracker
>>> >> 192.168.10.46 datanode/tasktracker
>>> >> 192.168.10.49 datanode/tasktracker
>>> >> 192.168.10.50 datanode/tasktracker
>>> >>
>>> >> hdfs logs around 20:33
>>> >> 192.168.10.48 namenode log  http://pastebin.com/rwgmPEXR
>>> >> 192.168.10.45 datanode log http://pastebin.com/HBgZ8rtV (I found this
>>> >> datanode crash first)
>>> >> 192.168.10.46 datanode log http://pastebin.com/aQ2emnUi
>>> >> 192.168.10.49 datanode log http://pastebin.com/aqsWrrL1
>>> >> 192.168.10.50 datanode log http://pastebin.com/V7C6tjpB
>>> >>
>>> >> hbase logs around 20:33
>>> >> 192.168.10.48 master log http://pastebin.com/2ZfeYA1p
>>> >> 192.168.10.45 region log http://pastebin.com/idCF2a7Y
>>> >> 192.168.10.46 region log http://pastebin.com/WEh4dA0f
>>> >> 192.168.10.49 region log http://pastebin.com/cGtpbTLz
>>> >> 192.168.10.50 region log http://pastebin.com/bD6h5T6p(very strange,
>>> >> not log at 20:33, but have log at 20:32 and 20:34)
>>> >>
>>> >> On Tue, Apr 22, 2014 at 12:25 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
>>> >> > Can you post more of the data node log, around 20:33 ?
>>> >> >
>>> >> > Cheers
>>> >> >
>>> >> >
>>> >> > On Mon, Apr 21, 2014 at 8:57 PM, Li Li <fancyerii@gmail.com>
wrote:
>>> >> >
>>> >> >> hadoop 1.0
>>> >> >> hbase 0.94.11
>>> >> >>
>>> >> >> datanode log from 192.168.10.45. why it shut down itself?
>>> >> >>
>>> >> >> 2014-04-21 20:33:59,309 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>> >> >> blk_-7969006819959471805_202154 received exception
>>> >> >> java.io.InterruptedIOException: Interruped while waiting for
IO on
>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>>> >> >> left.
>>> >> >> 2014-04-21 20:33:59,310 ERROR
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>> >> >> DatanodeRegistration(192.168.10.45:50010,
>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>>> >> >> java.io.InterruptedIOException: Interruped while waiting for
IO on
>>> >> >> channel java.nio.channels.SocketChannel[closed]. 0 millis timeout
>>> >> >> left.
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>> >> >>         at
>>> >> >>
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>> >> >>         at
>>> >> >>
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>> >> >>         at
>>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
>>> >> >>         at
>>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
>>> >> >>         at java.io.DataInputStream.read(DataInputStream.java:149)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
>>> >> >>         at java.lang.Thread.run(Thread.java:722)
>>> >> >> 2014-04-21 20:33:59,310 ERROR
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>> >> >> DatanodeRegistration(192.168.10.45:50010,
>>> >> >> storageID=DS-1676697306-192.168.10.45-50010-1392029190949,
>>> >> >> infoPort=50075, ipcPort=50020):DataXceiver
>>> >> >> java.io.InterruptedIOException: Interruped while waiting for
IO on
>>> >> >> channel java.nio.channels.SocketChannel[closed]. 466924 millis
>>> timeout
>>> >> >> left.
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:245)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197)
>>> >> >>         at
>>> >> >>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>> >> >>         at java.lang.Thread.run(Thread.java:722)
>>> >> >> 2014-04-21 20:34:00,291 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
>>> >> >> threadgroup to exit, active threads is 0
>>> >> >> 2014-04-21 20:34:00,404 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
>>> >> >> Shutting down all async disk service threads...
>>> >> >> 2014-04-21 20:34:00,405 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.FSDatasetAsyncDiskService:
All
>>> >> >> async disk service threads have been shut down.
>>> >> >> 2014-04-21 20:34:00,413 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
>>> >> >> 2014-04-21 20:34:00,424 INFO
>>> >> >> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
>>> >> >> /************************************************************
>>> >> >> SHUTDOWN_MSG: Shutting down DataNode at app-hbase-1/192.168.10.45
>>> >> >> ************************************************************/
>>> >> >>
>>> >> >> On Tue, Apr 22, 2014 at 11:25 AM, Ted Yu <yuzhihong@gmail.com>
>>> wrote:
>>> >> >> > bq. one datanode failed
>>> >> >> >
>>> >> >> > Was the crash due to out of memory error ?
>>> >> >> > Can you post the tail of data node log on pastebin ?
>>> >> >> >
>>> >> >> > Giving us versions of hadoop and hbase would be helpful.
>>> >> >> >
>>> >> >> >
>>> >> >> > On Mon, Apr 21, 2014 at 7:39 PM, Li Li <fancyerii@gmail.com>
>>> wrote:
>>> >> >> >
>>> >> >> >> I have a small hbase cluster with 1 namenode, 1 secondary
>>> namenode, 4
>>> >> >> >> datanode.
>>> >> >> >> and the hbase master is on the same machine with namenode,
4 hbase
>>> >> >> >> slave on datanode machine.
>>> >> >> >> I found average requests per seconds is about 10,000.
and the
>>> >> clusters
>>> >> >> >> crashed. and I found the reason is one datanode failed.
>>> >> >> >>
>>> >> >> >> the datanode configuration is about 4 cpu core and
10GB memory
>>> >> >> >> is my cluster overloaded?
>>> >> >> >>
>>> >> >>
>>> >>
>>>

Mime
View raw message