hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: RegionServer stuck in internalObtainRowLock forever - HBase 0.94.7
Date Tue, 18 Feb 2014 22:48:37 GMT
On Mon, Feb 10, 2014 at 12:25 AM, Asaf Mesika <asaf.mesika@gmail.com> wrote:

> Hi,
>
> We have HBase 0.94.7 deployed in production with 54 Region Servers (Hadoop
> 1).
> Couple of days ago, we had an incident which made our system unusable for
> several hours.
> HBase started emitting WARN exceptions indefinitely, thus failing any
> writes to it. Until stopped this RS, the issue wasn't resolved.
>
>
> 2014-02-07 02:10:14,415 WARN org.apache.hadoop.hbase.regionserver.HRegion:
> Failed getting lock in batch put,
>
> row=E\x09F\xD4\xD4\xE8\xF4\x8E\x10\x18UD\x0E\xE7\x11\x1B\x05\x00\x00\x01D\x0A\x18i\xA5\x11\x8C\xEC7\x87a`\x00
> java.io.IOException: Timed out on getting lock for
>
> row=E\x09F\xD4\xD4\xE8\xF4\x8E\x10\x18UD\x0E\xE7\x11\x1B\x05\x00\x00\x01D\x0A\x18i\xA5\x11\x8C\xEC7\x87a`\x00
>         at
>
> org.apache.hadoop.hbase.regionserver.HRegion.internalObtainRowLock(HRegion.java:3441)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.getLock(HRegion.java:3518)
>         at
>
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2282)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2153)
>         at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3755)
>         at sun.reflect.GeneratedMethodAccessor168.invoke(Unknown Source)
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
>
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
>
>
Another thread had this row lock and was blocked on hdfs?  The other thread
didn't time out it seems.  Perhaps the bad disk was responding, just really
slowly?  You have default for lock acquisition timeout (30s)?



>
> *So what happened?*
>
>    - At 01:52, from some reason, the local DataNode stopped responding to
>    the RS, although its logs seems alive until 02:10 and then nothing until
>    shut down manually.
>    - HBase gets time outs for writing to HDFS, understands there is a
>    problem with the local data node and excludes it.
>    - HDFS write for 1-2 minutes (Flush throughput) drops to 0.9 mb/sec, and
>    then is regained back to 56 mb/sec. all write are done to a remote data
>    node.
>    - And then suddenly the exception which is quoted above.
>
> Any idea what's this issue about?
>

After the exception, and hbase recalibrated on non-local replicas, all ran
'fine' thereafter?

Did you get a thread dump from around this time?

Could you see the disk going bad in your monitoring?  Upped latencies or
errors?  The disk just stopped working?

 St.Ack

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message