hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-644) DroppedSnapshotException but RegionServer doesn't restart
Date Mon, 26 May 2008 20:01:56 GMT

     [ https://issues.apache.org/jira/browse/HBASE-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack updated HBASE-644:
------------------------

    Attachment: 644-0.1-v2.patch

Thanks for review from vacation Jim.

(hbase-env.sh edit was not supposed to be included.  Thanks)

This patch adds checkFileSystem calling abort.  Previous it just set the abort and stop flags.

You point out that the server.stop in trunk refers to a different 'server'.  You think this
issue was introduced by a backport then (Blame says was added in revision 651067, HBASE-572)?

Yeah, I think we thought that there could be possibility of deadlock but on review, seems
like only lock on the HRS Thread is in the sleeper so it can call Thread.sleep so seems safe
to call abort from anywhere (Unless you can remember or point to a deadlock).  Version two
narrows the synchronizes (don't need to sync abort for instance).

Let me know if I can commit.  Thanks.

> DroppedSnapshotException but RegionServer doesn't restart
> ---------------------------------------------------------
>
>                 Key: HBASE-644
>                 URL: https://issues.apache.org/jira/browse/HBASE-644
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.1.3, 0.2.0
>
>         Attachments: 644-0.1-v1.patch, 644-0.1-v2.patch
>
>
> RegionServer was carrying -ROOT- and having trouble writing HDFS.  RegionServer judged
that a flush failed and reported a DroppedSnapshotException.  Usually, the filesystem check
would fail and set all the abort flags but it in this case filesystem somehow returned healthy
and the flags were not set.  The code path shutdown the RPC only and exited then we exited
the Flusher.  All the rest of the RegionServer stayed up and kept reporting the master.  The
master thought it alive and kept trying to scan the unreachable -ROOT-.  Cluster was hosed
until manual intervention 20 minutes later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message