hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schaback <johannes.schab...@visual-meta.com>
Subject single RegionServer stuck, causing cluster to hang
Date Fri, 22 Aug 2014 17:28:57 GMT
Dear HBase-Pros,

we face a serious issue with our HBase production cluster for two days now.
Every couple minutes, a random RegionServer gets stuck and does not process
any requests. In addition this causes the other RegionServers to
freeze within a minute which brings down the entire cluster. Stopping the
affected RegionServer unblocks the cluster and everything comes back to

We run 27 RegionServers, each having 31 GB JVM memory. The HBase Version is
0.98.5 on Hadoop 2.4.1. We basically have two tables, the first having
about 4,500 Regions and holding 8 TB with 1000 requests per second, the
second table is around 200 Regions with about 50,000 to 120,000 requests
per sec over all Regions, 800 GB worth of data and with IN_MEMORY enabled.

While investigating the problem, I found out, that every healthy
RegionServer has the following thread:

Thread 12 (RpcServer.listener,port=60020):
  Blocked count: 35
  Waited count: 0
    sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)

When suddenly becoming a blocked RegionServer, this particular thread then
looks like

Thread 12 (RpcServer.listener,port=60020):
  State: BLOCKED
  Blocked count: 2889
  Waited count: 0
  Blocked on org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@38cba1a1
  Blocked by 14 (RpcServer.reader=1,port=60020)



Also, JMX shows for an unhealthy RegionServer that

   - "queueSize" grows quickly and constantly to values greater than 60k,
   - "numCallsInGeneralQueue" quickly reaches 300

Both values are usually very small or 0 under normal circumstances, but in
case of a RS "getting stuck" they explode, which leads me to believe that
the IPC-queue does not get processed properly causing the RegionServer to
become "deaf".

These two symptoms appear to bring down the entire cluster. When killign
that RS, everyhing goes back to normal.

I could not find any correlation between this phenomenon and compactions,
load or other factors. hbck says it is all fine as well.

The servers are all 3.2.0-4-amd64 Debian, 12 cores, 96 GB RAM. Besides the
RS and a DataNode, there isn't too much running on the boxes so the load
(top) is usually around 5 to 10 and bandwidth does not exceed 10 MB on

We currently survive by polling /jmx of all RegionServers constantly and
restarting those off that show the symptioms :(

Do you have any idea what could be causing this?

Thank you very much in advance!


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message