hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
Date Fri, 28 Aug 2015 04:39:45 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718051#comment-14718051
] 

stack commented on HBASE-14317:
-------------------------------

This is from the attached log from original complaint:

{code}
2015-08-23 07:22:26,060 FATAL [regionserver/r12s16.sjc.aristanetworks.com/172.24.32.16:9104.append-pool1-t1]
wal.FSHLog: Could not append. Requesting close of wal
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more
good datanodes being available to try. (Nodes: current=[172.24.32.16:10110, 172.24.32.13:10110],
original=[172.24.32.16:10110, 172.24.32.13:10110]). The current failed datanode replacement
policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy'
in its configuration.
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:969)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1035)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1184)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:933)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)
{code}

it looks like yours in that the complaint is that we cannot append.

If I manufacture a failed append, I can get a hang. It is this logic in the finally for HRegion#doMiniBatchMutation
.. and probably in all other places we do the append/sync dance. At the end of step 5, we
do the WAL append and if we get an IOE, which is what you have pasted and is what we have
in original complaint's log, then we go to the finally:

{code}
    } finally {
      // if the wal sync was unsuccessful, remove keys from memstore
      if (doRollBackMemstore) {
        rollbackMemstore(memstoreCells);
      }
      if (w != null) {
        mvcc.completeMemstoreInsertWithSeqNum(w, walKey);
      }
...
{code}

The rollback of edits if fine but w is not null in the above and we go to complete the insert
in mvcc and inside here, we ask the walKey for its sequenceid... which is assigned AFTER we
append ... only the append failed.  So we wait...

Let me look a bit more.

I think your patch would break a wait on safe point but am not sure it would unblock all threads.
Let me try and manufacture safepoint waiters too.  Will be back.







> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Priority: Critical
>         Attachments: HBASE-14317.patch, [Java] RS stuck on WAL sync to a dead DN - Pastebin.com.html,
raw.php, subset.of.rs.log
>
>
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. See attached
thread dump and associated log. What is interesting is that syncers are waiting to take syncs
to run and at same time we want to flush so we are waiting on a safe point but there seems
to be nothing in our ring buffer; did we go to roll log and not add safe point sync to clear
out ringbuffer?
> Needs a bit of study. Try to reproduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message