Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8DBEA1005E for ; Sat, 18 Jan 2014 00:55:33 +0000 (UTC) Received: (qmail 24253 invoked by uid 500); 18 Jan 2014 00:55:27 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 24119 invoked by uid 500); 18 Jan 2014 00:55:24 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 23992 invoked by uid 99); 18 Jan 2014 00:55:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Jan 2014 00:55:21 +0000 Date: Sat, 18 Jan 2014 00:55:21 +0000 (UTC) From: "Devaraj Das (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-10070) HBase read high-availability using eventually consistent region replicas MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-10070?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D138= 75452#comment-13875452 ]=20 Devaraj Das commented on HBASE-10070: ------------------------------------- bq. Ok on the timing. You know how I feel about 1.0 =E2=80=93 sooner rather= than later =E2=80=93 but hopefully this feature gets done in time. Yeah.. couple of us are on it. bq. After thinking more on this, I 'get' why you have the replicas listed i= nside in the row rather than as rows themselves [in hbase:meta]. The row in= hbase:meta becomes a proxy or facade for the little cluster of regions one= of which is the primary with the others read replicas.=20 That's great. A copy-paste of what I said in the RB on HBASE-10347 for othe= rs' reference. "I and Enis had debated this as well. The consensus between us was that we = don't need to add new META rows for the replicas. After all, the HRI inform= ation is exactly the same for all the replicas except for the replicaID. In= the current meta, we already have a column for the location of a region. I= t seemed logical to just extend that model - add newer columns for the repl= ica locations (and similarly for the other columns like seqnum). That way e= verything for a particular user-visible region stays in one row (and makes = it easier for readers to know about all replica locations from that one row= ). Regarding special casing, yes there is some special casing in the way th= e regions are added to the meta - create table will create all regions (if = the table was created with replica > 1), but only the primary regions will = be added to the meta. The regionserver - when it updates the meta with the = location after it opens a region invokes the API passing the replicaID as a= n argument - the column names are different based on whether the replicaID = is primary or not. These are pretty much the special cases for the meta upd= ates." bq. HRegionInfo now is overloaded. Before it was the info on a specific reg= ion. Now it is trying to serve two purposes; its original intent and now to= o as a descriptor on the region-serving 'cluster' made of a primary and rep= licas. Lets avoid overloading what up to this has had a clear role in the h= base model. By doing it the way we have in the patch on HBASE-10347, it seems to reflec= t what's going on - "HRI is a logical descriptor and a facade for a bunch o= f primary & replicas". That's how we store things in the meta and how we re= construct HRIs from the meta when needed. There are possibly other approaches of doing this. E.g. Extend HRegionInfo = as, say, HRegionInfoReplica and maintain the information about replicaID th= ere, and/or change all the relevant methods to accept HRegionInfoReplica an= d potentially return this as well in relevant situations. The issue there i= s those approaches would be very intrusive and we would still need special = cases for replicaID =3D=3D 0 or not. Not confident how much we would gain t= here. Is it too much to ask to change the view of what a HRI means (to what= you say above). Anyway, let me ponder a bit on this... bq. The primary holds the 'pole position' being the name of the region in m= eta. The read replicas are differently named with the 00001 and 00002, etc.= , interpolated into the middle of the region name. I suppose doing it this = way 'minimizes' the disturbance in the code base but I'm worried this namin= g exception will only confuse though it minimizes change. Why would the pri= mary not be named like the replica regions? I don't mind naming the primary regions similar to the replicas. This might= mean tools that currently depend on the name format would break even if th= e cluster is not deploying tables with replicas (you guessed that response = :-)) But yeah, if you go the full Paxos route, the 'primary' could be anyon= e in the replica-set and there it makes sense to have all members in the se= t to have an index. > HBase read high-availability using eventually consistent region replicas > ------------------------------------------------------------------------ > > Key: HBASE-10070 > URL: https://issues.apache.org/jira/browse/HBASE-10070 > Project: HBase > Issue Type: New Feature > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Attachments: HighAvailabilityDesignforreadsApachedoc.pdf > > > In the present HBase architecture, it is hard, probably impossible, to sa= tisfy constraints like 99th percentile of the reads will be served under 10= ms. One of the major factors that affects this is the MTTR for regions. Th= ere are three phases in the MTTR process - detection, assignment, and recov= ery. Of these, the detection is usually the longest and is presently in the= order of 20-30 seconds. During this time, the clients would not be able to= read the region data. > However, some clients will be better served if regions will be available = for reads during recovery for doing eventually consistent reads. This will = help with satisfying low latency guarantees for some class of applications = which can work with stale reads. > For improving read availability, we propose a replicated read-only region= serving design, also referred as secondary regions, or region shadows. Ext= ending current model of a region being opened for reads and writes in a sin= gle region server, the region will be also opened for reading in region ser= vers. The region server which hosts the region for reads and writes (as in = current case) will be declared as PRIMARY, while 0 or more region servers m= ight be hosting the region as SECONDARY. There may be more than one seconda= ry (replica count > 2). > Will attach a design doc shortly which contains most of the details and s= ome thoughts about development approaches. Reviews are more than welcome.= =20 > We also have a proof of concept patch, which includes the master and regi= ons server side of changes. Client side changes will be coming soon as well= .=20 -- This message was sent by Atlassian JIRA (v6.1.5#6160)