hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14317) Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
Date Thu, 03 Sep 2015 18:36:47 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729538#comment-14729538

stack commented on HBASE-14317:

Ran on small cluster (1B ITBLL with monkeys and confirmed all data there). Checked logs. No
hang or no complaints related to this patch. Just the usual complaint about slow HDFS including
stuff like this:

2015-09-02 23:56:52,790 WARN  [regionserver/c2023.halxg.cloudera.com/]
hdfs.DFSClient: Slow waitForAckedSeqno took 2577ms (threshold=20ms)

Also dfs client complaints and exceptions... but nothing from RS or related to WAL.

Looking at the failed test, on the one hand, the lease was just robbed on all WALs out from
under the cluster. Let me make sure the fail is because of stricter semantic and not from
any other byproduct. Looking at it, we should be able to ride over the HDFS restart. Will
be back.

> Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL
> -----------------------------------------------------
>                 Key: HBASE-14317
>                 URL: https://issues.apache.org/jira/browse/HBASE-14317
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.1
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 2.0.0, 1.2.0, 1.0.3, 1.1.3
>         Attachments: 14317.test.txt, 14317v10.txt, 14317v11.txt, 14317v12.txt, 14317v13.txt,
14317v5.branch-1.2.txt, 14317v5.txt, 14317v9.txt, HBASE-14317-v1.patch, HBASE-14317-v2.patch,
HBASE-14317-v3.patch, HBASE-14317-v4.patch, HBASE-14317.patch, [Java] RS stuck on WAL sync
to a dead DN - Pastebin.com.html, append-only-test.patch, raw.php, repro.txt, san_dump.txt,
> hbase-1.1.1 and hadoop-2.7.1
> We try to roll logs because can't append (See HDFS-8960) but we get stuck. See attached
thread dump and associated log. What is interesting is that syncers are waiting to take syncs
to run and at same time we want to flush so we are waiting on a safe point but there seems
to be nothing in our ring buffer; did we go to roll log and not add safe point sync to clear
out ringbuffer?
> Needs a bit of study. Try to reproduce.

This message was sent by Atlassian JIRA

View raw message