hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13782) RS stuck after FATAL ``FSHLog: Could not append.''
Date Sun, 31 May 2015 18:31:17 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566659#comment-14566659
] 

Lars Hofhansl commented on HBASE-13782:
---------------------------------------

An RS _should_ shut down when encountering a FATAL problem. This is a bug.
Please list the exact version of HBase used, rather than a vendor specific version number.


> RS stuck after FATAL ``FSHLog: Could not append.''
> --------------------------------------------------
>
>                 Key: HBASE-13782
>                 URL: https://issues.apache.org/jira/browse/HBASE-13782
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 1.0.1
>         Environment: hbaes version: 1.0.0-cdh5.4.0
> hadoop version: 2.6.0-cdh5.4.0 
>            Reporter: Mingjie Lai
>            Priority: Critical
>         Attachments: hbase-rs.log, hbase-site.xml
>
>
> hbaes version: 1.0.0-cdh5.4.0
> hadoop version: 2.6.0-cdh5.4.0 
> Environment: 40-node hadoop cluster shared with a 10-node hbase cluster and a 30-node
yarn.
> We started to see that one RS stopped to serve any client request since 2015-05-26 01:05:33,
while all other RS were okay. I checked RS log and found that there are some FATAL logs when
org.apache.hadoop.hbase.regionserver.wal.FSHLog tried to append() and sync{}:
> {code}
> 2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could
not append. Requesting close of wal
> java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
> 2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could
not append. Requesting close of wal
> java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
> 2015-05-26 01:05:33,700 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could
not append. Requesting close of wal
> java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
> 	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
> 2015-05-26 01:05:33,700 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Archiving
hdfs://nameservice1/hbase/WALs/hbase08.company.com,60020,1431985722474/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
to hdfs://nameservice1/hbase/oldWALs/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
> 2015-05-26 01:05:33,701 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error
syncing, request close of wal 
> {code}
> Since the HDFS cluster is shared with a YARN cluster, at the time, there were some io
heavy jobs running, and exhausted xciever at some of the DNs at the exact same time. I think
it's the reason why the RS got ``java.io.IOException: Bad connect ack with firstBadLink''
> The problem is, the RS got stuck without any response since then. flushQueueLength grew
to the ceiling and stayed there. The only log entries are from periodicFlusher:
> {code}
> 2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher
requesting flush for region myns:mytable,3992+80bb1,1432526964367.c4906e519c1f8206a284c66a8eda2159.
after a delay of 11000
> 2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher
requesting flush for region myns:mytable,0814+0416,1432541066864.cf42d5ab47e051d69e516971e82e84be.
after a delay of 7874
> 2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher
requesting flush for region myns:mytable,2022+7a571,1432528246524.299c1d4bb28fda2a4d9f248c6c22153c.
after a delay of 22740
> 2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher
requesting flush for region myns:mytable,2635+b9b677,1432540367215.749efc885317a2679e2ea39bb0845fbe.
after a delay of 3162
> 2015-05-26 02:06:26,742 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher
requesting flush for region myns:mytable,0401+985e,1432527151473.eb97576381fce10a9616efd471103920.
after a delay of 9142
> {code}
> Looks like there is a RS level deadlock triggered by the FATAL append exception handling.
In the end, I had to restart the RS service to rescue the regions from the stuck RS.
> {code}
>       } catch (Exception e) {
>         LOG.fatal("Could not append. Requesting close of wal", e);
>         requestLogRoll();
>         throw e;
>       }
>       numEntries.incrementAndGet();
>     }
> {code}
> Maybe the RS can just suicide after the FATAL exception since it cannot append WAL to
hdfs? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message