hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "churro morales (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-12814) Zero downtime upgrade from 94 to 98 with replication
Date Wed, 07 Jan 2015 01:04:34 GMT
churro morales created HBASE-12814:

             Summary: Zero downtime upgrade from 94 to 98 with replication
                 Key: HBASE-12814
                 URL: https://issues.apache.org/jira/browse/HBASE-12814
             Project: HBase
          Issue Type: New Feature
    Affects Versions: 0.94.26, 0.98.10
            Reporter: churro morales
            Assignee: churro morales

Here at Flurry we want to upgrade our HBase cluster from 94 to 98 while not having any downtime
and maintaining master / master replication. 

Replication is done via thrift RPC between clusters.  It is configurable on a peer by peer
basis and the one caveat is that a thrift server starts up on every node which proxies the
request to the ReplicationSink.  

For the upgrade process:
* in hbase-site.xml two new configuration parameters are added:
** *Required*
*** hbase.replication.sink.enable.thrift -> true
*** hbase.replication.thrift.server.port -> <thrit_server_port>
** *Optional*
*** hbase.replication.thrift.protection {default: AUTHENTICATION}
*** hbase.replication.thrift.framed {default: false}
*** hbase.replication.thrift.compact {default: true}

- All regionservers can be rolling restarted (no downtime), all clusters must have the respective
patch for this to work.
- the hbase shell add_peer command takes an additional parameter for rpc protocol
- example: {code} add_peer '1' "hbase-101:2181:/hbase", "THRIFT" {code}

Now comes the fun part when you want to upgrade your cluster from 94 to 98 you simply pause
replication to the cluster being upgraded, do the upgrade and un-pause replication.  Once
you have a pair of clusters only replicating inbound and outbound with the 98 release.  You
can start replicating via the native rpc protocol by adding the peer again without the _THRIFT_
parameter and subsequently deleting the peer with the thrift protocol.  Because replication
is idempotent I don't see any issues as long as you wait for the backlog to drain after un-pausing

Special thanks to Francis Liu at Yahoo for laying the groundwork and Mr. Dave Latham for his
invaluable knowledge and assistance.  

This message was sent by Atlassian JIRA

View raw message