hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting
Date Thu, 15 Jul 2010 23:54:49 GMT

     [ https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack updated HBASE-2707:
-------------------------

    Attachment: 2707-v3.txt

Latest iteration.  What is currently in place is currently broke it turns out.  More on this
later.

> Can't recover from a dead ROOT server if any exceptions happens during log splitting
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-2707
>                 URL: https://issues.apache.org/jira/browse/HBASE-2707
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: 2707-0.20.txt, 2707-test.txt, 2707-v3.txt, HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually from a
GC-like event. It happens frequently to my TestReplication in HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. Removing old
log dir hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN  [master] master.RegionServerOperationQueue(183): Failed
processing: ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed
todo queue
> java.io.IOException: Cannot delete: hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
>         at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
>         at org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException: /user/jdcryans/.logs/10.10.1.63,55846,1276194933831
is non empty
> 2010-06-10 11:35:52,097 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT-
isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT-
isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] master.ServerManager$ServerMonitor(131):
1 region servers, 1 dead, average load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT-
isn't online, can't process delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master] master.RegionServerOperationQueue(126): -ROOT-
isn't online, can't process delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if ROOT isn't
online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message