hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asaf Mesika <asaf.mes...@gmail.com>
Subject RegionServer stuck in internalObtainRowLock forever - HBase 0.94.7
Date Mon, 10 Feb 2014 08:25:58 GMT
Hi,

We have HBase 0.94.7 deployed in production with 54 Region Servers (Hadoop
1).
Couple of days ago, we had an incident which made our system unusable for
several hours.
HBase started emitting WARN exceptions indefinitely, thus failing any
writes to it. Until stopped this RS, the issue wasn't resolved.


2014-02-07 02:10:14,415 WARN org.apache.hadoop.hbase.regionserver.HRegion:
Failed getting lock in batch put,
row=E\x09F\xD4\xD4\xE8\xF4\x8E\x10\x18UD\x0E\xE7\x11\x1B\x05\x00\x00\x01D\x0A\x18i\xA5\x11\x8C\xEC7\x87a`\x00
java.io.IOException: Timed out on getting lock for
row=E\x09F\xD4\xD4\xE8\xF4\x8E\x10\x18UD\x0E\xE7\x11\x1B\x05\x00\x00\x01D\x0A\x18i\xA5\x11\x8C\xEC7\x87a`\x00
        at
org.apache.hadoop.hbase.regionserver.HRegion.internalObtainRowLock(HRegion.java:3441)
        at
org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:3518)
        at
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2282)
        at
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2153)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3755)
        at sun.reflect.GeneratedMethodAccessor168.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)


*So what happened?*

   - At 01:52, from some reason, the local DataNode stopped responding to
   the RS, although its logs seems alive until 02:10 and then nothing until
   shut down manually.
   - HBase gets time outs for writing to HDFS, understands there is a
   problem with the local data node and excludes it.
   - HDFS write for 1-2 minutes (Flush throughput) drops to 0.9 mb/sec, and
   then is regained back to 56 mb/sec. all write are done to a remote data
   node.
   - And then suddenly the exception which is quoted above.

Any idea what's this issue about?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message