hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: Replication hosed after simple cluster restart
Date Thu, 14 Mar 2013 01:52:01 GMT
No, this all RSs trying to copy the failed RS' queues over. Most of them will fail, but still
report all the queues that they *tried* to move. For each of those a ReplicationSource is
setup, which then fails when attempting to update the replication status (because their ZK
nodes were never created) and then causing the RS to abort (because the RS is setup as abortable
for the zkHelper used here - which actually might be yet another problem).

-- Lars

 From: Stack <stack@duboce.net>
To: HBase Dev List <dev@hbase.apache.org> 
Sent: Wednesday, March 13, 2013 6:43 PM
Subject: Re: Replication hosed after simple cluster restart
Not sure I follow.  Is this our making use of multi against a zk ensemble
that doesn't support it?
On Mar 13, 2013 6:22 PM, "lars hofhansl" <larsh@apache.org> wrote:

> I suppose the problem could be in
> zkHelper.copyQueuesFromRSUsingMulti(rsZnode) as called from
> ReplicationSourceManager.NodeFailoverWorker.run().
> copyQueuesFromRSUsingMulti will return the queues it read even when the
> multi operation failed (because another RS managed to execute it first).
> -- Lars
> ________________________________
>  From: lars hofhansl <larsh@apache.org>
> To: hbase-dev <dev@hbase.apache.org>
> Sent: Wednesday, March 13, 2013 6:12 PM
> Subject: Replication hosed after simple cluster restart
> We just ran into an interesting scenario. We restarted a cluster that was
> setup as a replication source.
> The stop went cleanly.
> Upon restart *all* regionservers aborted within a few seconds with
> variations of these errors:
> http://pastebin.com/3iQVuBqS
> This is scary!
> -- Lars
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message