Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 32AFF9D42 for ; Fri, 2 Dec 2011 21:42:48 +0000 (UTC) Received: (qmail 76158 invoked by uid 500); 2 Dec 2011 21:42:46 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 76089 invoked by uid 500); 2 Dec 2011 21:42:46 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 76081 invoked by uid 99); 2 Dec 2011 21:42:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Dec 2011 21:42:46 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ivarley@salesforce.com designates 64.18.3.36 as permitted sender) Received: from [64.18.3.36] (HELO exprod8og118.obsmtp.com) (64.18.3.36) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 02 Dec 2011 21:42:40 +0000 Received: from exsfm-hub3.internal.salesforce.com ([204.14.239.238]) by exprod8ob118.postini.com ([64.18.7.12]) with SMTP ID DSNKTtlGO5gT+qcxnxvTpD2WEGTN0wLiDXtn@postini.com; Fri, 02 Dec 2011 13:42:20 PST Received: from EXSFM-MB01.internal.salesforce.com ([10.1.127.45]) by exsfm-hub3.internal.salesforce.com ([10.1.127.7]) with mapi; Fri, 2 Dec 2011 13:42:19 -0800 From: Ian Varley To: "user@hbase.apache.org" Date: Fri, 2 Dec 2011 13:42:18 -0800 Subject: Re: HBase and Consistency in CAP Thread-Topic: HBase and Consistency in CAP Thread-Index: AcyxO0ReuqJeGnOcTbiSIeJ/urcUOw== Message-ID: <56C24FFB-76C5-4DB1-9AEA-46E9C191F978@salesforce.com> References: <9AC337EB-45D5-48DF-95A8-1CF76504C0EF@salesforce.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_56C24FFB76C54DB19AEA46E9C191F978salesforcecom_" MIME-Version: 1.0 --_000_56C24FFB76C54DB19AEA46E9C191F978salesforcecom_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable The simple answer is that HBase isn't architected such that 2 region server= s can simultaneously host the same region. In addition to being much simple= r from an architecture point of view, that also allows for user-facing feat= ures that would be difficult or impossible to achieve otherwise: single-row= put atomicity, atomic check-and-set operations, atomic increment operation= s, etc.--things that are only possible if you know for sure that exactly on= e machine is in control of the row. Ian On Dec 2, 2011, at 2:54 PM, Mohit Anchlia wrote: Thanks for the overview. It's helpful. Can you also help me understand why 2 region servers for the same row keys can't be running on the nodes where blocks are being replicated? I am assuming all the logs/HFiles etc are already being replicated so if one region server fails other region server is still taking reads/writes. On Fri, Dec 2, 2011 at 12:15 PM, Ian Varley > wrote: Mohit, Yeah, those are great places to go and learn. To fill in a bit more on this topic: "partition-tolerance" usually refers t= o the idea that you could have a complete disconnection between N sets of m= achines in your data center, but still be taking writes and serving reads f= rom all the servers. Some "NoSQL" databases can do this (to a degree), but = HBase cannot; the master and ZK quorum must be accessible from any machine = that's up and running the cluster. Individual machines can go down, as J-D said, and the master will reassign = those regions to another region server. So, imagine you had a network switc= h fail that disconnected 10 machines in a 20-machine cluster; you wouldn't = have 2 baby 10-machine clusters, like you might with some other software; y= ou'd just have 10 machines "down" (and probably a significant interruption = while the master replays logs on the remaining 10). That would also require= that the underlying HDFS cluster (assuming it's on the same machines) was = keeping replicas of the blocks on different racks (which it does by default= ), otherwise there's no hope. HBase makes this trade-off intentionally, because in real-world scenarios, = there aren't too many cases where a true network partition would be survive= d by the rest of your stack, either (e.g. imagine a case where application = servers can't access a relational database server because of a partition; y= ou're just down). The focus of HBase fault tolerance is recovering from iso= lated machine failures, not the collapse of your infrastructure. Ian On Dec 2, 2011, at 2:03 PM, Jean-Daniel Cryans wrote: Get the HBase book: http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100 And/Or read the Bigtable paper. J-D On Fri, Dec 2, 2011 at 12:01 PM, Mohit Anchlia > wrote: Where can I read more on this specific subject? Based on your answer I have more questions, but I want to read more specific information about how it works and why it's designed that way. On Fri, Dec 2, 2011 at 11:59 AM, Jean-Daniel Cryans > wrote: No, data is only served by one region server (even if it resides on multiple data nodes). If it dies, clients need to wait for the log replay and region reassignment. J-D On Fri, Dec 2, 2011 at 11:57 AM, Mohit Anchlia > wrote: Why is HBase consisdered high in consistency and that it gives up parition tolerance? My understanding is that failure of one data node still doesn't impact client as they would re-adjust the list of available data nodes. --_000_56C24FFB76C54DB19AEA46E9C191F978salesforcecom_--