Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BBD6A109F5 for ; Fri, 31 Jan 2014 15:34:56 +0000 (UTC) Received: (qmail 99854 invoked by uid 500); 31 Jan 2014 15:34:50 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 99809 invoked by uid 500); 31 Jan 2014 15:34:48 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 99768 invoked by uid 99); 31 Jan 2014 15:34:47 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Jan 2014 15:34:47 +0000 Date: Fri, 31 Jan 2014 15:34:47 +0000 (UTC) From: "Feng Honghua (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-10296) Replace ZK with a consensus lib(paxos,zab or raft) running within master processes to provide better master failover performance and state consistency MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887841#comment-13887841 ] Feng Honghua commented on HBASE-10296: -------------------------------------- bq. Varying group membership and compacting the logs we'd have to contrib. I suppose we'd write the logs to the local filesystem if we want edits persisted. # Yes we need the snapshot feature provided from within the lib which is used by the app/user(here is our HMaster) to reduce log files, otherwise the number of log files can increase immensely over time(a bit like the motivation of flush in HBase)--I assume you meant snapshot when you said 'compacting the logs'. and the snapshot functionality should be as a callback function implemented by user logic, and after it's done the consensus lib perform removing according log files, right? :-). # Agree with you on that we'd write the logs to the local filesystem for persistence, no doubt here. # Varying group membership has relatively lower priority than others, can safely set as a nice-to-have feature in the beginning in the light that we almost always use a pre-configured fixed set of machines as HMaster, right? bq.We'd have no callback/notification mechanism so it will take a little more effort replacing the zk-based mechanism. We'd have to add being able to pass state messages for say when a region has opened on a regionserver or is closing... I would propose to replace zk in an incremental fashion: # for region assignment status info, we move them out of zk to the embedded in-memory consensus lib instance. # zk can still serve as the central truth-holder storage for the 'configuration'-like data such as replication info, since zk does it job well for such use scenario(we have analysed it more comprehensively in HBASE-1755:-)). # zk also remain as the liveness monitor for regionservers(but not for HMaster's healthy which is now handled by the consensus lib instance itself) before we implement heartbeat directly between HMaster and regionservers. # for region assignment status info, since HMaster and regionservers now talk directly by sending request/response messages between HMaster and regionservers after we use in-memory consensus lib, it's natural that 'they are able to pass state messages for say when a region has opened on a regionserver or is closing' > Replace ZK with a consensus lib(paxos,zab or raft) running within master processes to provide better master failover performance and state consistency > ------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: HBASE-10296 > URL: https://issues.apache.org/jira/browse/HBASE-10296 > Project: HBase > Issue Type: Brainstorming > Components: master, Region Assignment, regionserver > Reporter: Feng Honghua > > Currently master relies on ZK to elect active master, monitor liveness and store almost all of its states, such as region states, table info, replication info and so on. And zk also plays as a channel for master-regionserver communication(such as in region assigning) and client-regionserver communication(such as replication state/behavior change). > But zk as a communication channel is fragile due to its one-time watch and asynchronous notification mechanism which together can leads to missed events(hence missed messages), for example the master must rely on the state transition logic's idempotence to maintain the region assigning state machine's correctness, actually almost all of the most tricky inconsistency issues can trace back their root cause to the fragility of zk as a communication channel. > Replace zk with paxos running within master processes have following benefits: > 1. better master failover performance: all master, either the active or the standby ones, have the same latest states in memory(except lag ones but which can eventually catch up later on). whenever the active master dies, the newly elected active master can immediately play its role without such failover work as building its in-memory states by consulting meta-table and zk. > 2. better state consistency: master's in-memory states are the only truth about the system,which can eliminate inconsistency from the very beginning. and though the states are contained by all masters, paxos guarantees they are identical at any time. > 3. more direct and simple communication pattern: client changes state by sending requests to master, master and regionserver talk directly to each other by sending request and response...all don't bother to using a third-party storage like zk which can introduce more uncertainty, worse latency and more complexity. > 4. zk can only be used as liveness monitoring for determining if a regionserver is dead, and later on we can eliminate zk totally when we build heartbeat between master and regionserver. > I know this might looks like a very crazy re-architect, but it deserves deep thinking and serious discussion for it, right? -- This message was sent by Atlassian JIRA (v6.1.5#6160)