hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Feng Honghua (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10296) Replace ZK with a consensus lib(paxos,zab or raft) running within master processes to provide better master failover performance and state consistency
Date Sat, 15 Feb 2014 04:27:25 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902322#comment-13902322
] 

Feng Honghua commented on HBASE-10296:
--------------------------------------

bq.ZK is not good enough, but do it by your own will make things worse.
Would you list the detailed reasons for this statement? Do you mean the coding complexity
and correctness risk when implementing our own consensus lib when saying 'will make things
worse'? Or anything else? :-)
bq.The only real problem I can see is that ZK is not strong consistent.
ZK itself should be strong consistent, right? But our ZK usage of 'A process changes a znode,
B process watches that znode and then reads the znode value to trigger its state-machine'
pattern for maintaining the state-machine logic(especially assign state-machine) results in
the inconsistency problem in HMaster...but the data/states we put in ZK still have consistency,
right?
bq.This can be done with the existed API (but performance is much inefficient than chubby).
Actually if we make HMaster the arbitrator and only HMaster can write to ZK, ZK acts as the
only truth holder, regionservers can't write/update the states directly to ZK but talk to
HMaster and HMaster updates to ZK for them...this way the current inconsistency issue of HMaster
can be remarkably alleviated. But still need careful treatment/handling for maintaining the
consistency between ZK and HMaster's in-memory data...

> Replace ZK with a consensus lib(paxos,zab or raft) running within master processes to
provide better master failover performance and state consistency
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10296
>                 URL: https://issues.apache.org/jira/browse/HBASE-10296
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: master, Region Assignment, regionserver
>            Reporter: Feng Honghua
>
> Currently master relies on ZK to elect active master, monitor liveness and store almost
all of its states, such as region states, table info, replication info and so on. And zk also
plays as a channel for master-regionserver communication(such as in region assigning) and
client-regionserver communication(such as replication state/behavior change). 
> But zk as a communication channel is fragile due to its one-time watch and asynchronous
notification mechanism which together can leads to missed events(hence missed messages), for
example the master must rely on the state transition logic's idempotence to maintain the region
assigning state machine's correctness, actually almost all of the most tricky inconsistency
issues can trace back their root cause to the fragility of zk as a communication channel.
> Replace zk with paxos running within master processes have following benefits:
> 1. better master failover performance: all master, either the active or the standby ones,
have the same latest states in memory(except lag ones but which can eventually catch up later
on). whenever the active master dies, the newly elected active master can immediately play
its role without such failover work as building its in-memory states by consulting meta-table
and zk.
> 2. better state consistency: master's in-memory states are the only truth about the system,which
can eliminate inconsistency from the very beginning. and though the states are contained by
all masters, paxos guarantees they are identical at any time.
> 3. more direct and simple communication pattern: client changes state by sending requests
to master, master and regionserver talk directly to each other by sending request and response...all
don't bother to using a third-party storage like zk which can introduce more uncertainty,
worse latency and more complexity.
> 4. zk can only be used as liveness monitoring for determining if a regionserver is dead,
and later on we can eliminate zk totally when we build heartbeat between master and regionserver.
> I know this might looks like a very crazy re-architect, but it deserves deep thinking
and serious discussion for it, right?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message