hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingjie Lai (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-13782) RS stuck after FATAL ``FSHLog: Could not append.''
Date Tue, 26 May 2015 23:26:20 GMT
Mingjie Lai created HBASE-13782:
-----------------------------------

             Summary: RS stuck after FATAL ``FSHLog: Could not append.''
                 Key: HBASE-13782
                 URL: https://issues.apache.org/jira/browse/HBASE-13782
             Project: HBase
          Issue Type: Bug
          Components: wal
    Affects Versions: 1.0.1
         Environment: hbaes version: 1.0.0-cdh5.4.0
hadoop version: 2.6.0-cdh5.4.0 
            Reporter: Mingjie Lai
            Priority: Critical


hbaes version: 1.0.0-cdh5.4.0
hadoop version: 2.6.0-cdh5.4.0 


Environment: 40-node hadoop cluster shared with a 10-node hbase cluster and a 30-node yarn.

We started to see that one RS stopped to serve any client request since 2015-05-26 01:05:33,
while all other RS were okay. I checked RS log and found that there are some FATAL logs when
org.apache.hadoop.hbase.regionserver.wal.FSHLog tried to append() and sync{}:

{code}

2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append.
Requesting close of wal
java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append.
Requesting close of wal
java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append.
Requesting close of wal
java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
2015-05-26 01:05:33,700 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Archiving hdfs://nameservice1/hbase/WALs/hbase08.company.com,60020,1431985722474/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
to hdfs://nameservice1/hbase/oldWALs/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
2015-05-26 01:05:33,701 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing,
request close of wal 

{code}

Since the HDFS cluster is shared with a YARN cluster, at the time, there were some io heavy
jobs running, and exhausted xciever at some of the DNs at the exact same time. I think it's
the reason why the RS got ``java.io.IOException: Bad connect ack with firstBadLink''

The problem is, the RS got stuck without any response since then. flushQueueLength grew to
the ceiling and stayed there. The only log entries are from periodicFlusher:

{code}
2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher
requesting flush for region myns:mytable,3992+80bb1,1432526964367.c4906e519c1f8206a284c66a8eda2159.
after a delay of 11000
2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher
requesting flush for region myns:mytable,0814+0416,1432541066864.cf42d5ab47e051d69e516971e82e84be.
after a delay of 7874
2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher
requesting flush for region myns:mytable,2022+7a571,1432528246524.299c1d4bb28fda2a4d9f248c6c22153c.
after a delay of 22740
2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher
requesting flush for region myns:mytable,2635+b9b677,1432540367215.749efc885317a2679e2ea39bb0845fbe.
after a delay of 3162
2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher
requesting flush for region myns:mytable,0401+985e,1432527151473.eb97576381fce10a9616efd471103920.
after a delay of 9142
{code}

Looks like there is a RS level deadlock triggered by the FATAL append exception handling.
In the end, I had to restart the RS service to rescue the regions from the stuck RS.

{code}
      } catch (Exception e) {
        LOG.fatal("Could not append. Requesting close of wal", e);
        requestLogRoll();
        throw e;
      }
      numEntries.incrementAndGet();
    }
{code}

Maybe the RS can just suicide after the FATAL exception since it cannot append WAL to hdfs?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message