Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of jon@cloudera.com designates
 209.85.220.181 as permitted sender)
MIME-Version: 1.0
From: Jonathan Hsieh <jon@cloudera.com>
Date: Mon, 2 Dec 2013 22:00:44 -0800
Message-ID: 
 <CAAha9a26X6A8ewPcHxUWQyAd5dfPsYdr8KmXht9Ue62DxT5p3A@mail.gmail.com>
Subject: Re: [Shadow Regions / Read Replicas ] Block Affinity
To: "dev@hbase.apache.org" <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a11c3bea098d87704ec9b05ab

--001a11c3bea098d87704ec9b05ab
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

> Enis:
> I was trying to refer to not having co-location constraints for secondary
replicas whose primaries are hosted by the same
> RS. For example, if R1(replica=3D0), and R2(replica=3D0) are hosted on RS=
1,
R1(replica=3D1) and R2(replica=3D1) can be hosted by RS2
> and RS3 respectively. This can definitely use the hdfs block affinity
work though.

This particular example doesn't have enough to tease out the different
ideal situations.  Hopefully this will help:

We have RS-A hosting regions X and Y.  With affinity groups let's say
RS-A's logs are written to RS-A, RS-L, and RS-M.

Let's also say that X is written to RS-A, RS-B and RS-C and Y is to RS-A,
RS-D, RS-E.

>> Jon:
>> However, I don't think we get into a situation where all RS's must read
all other RS's logs =96 we only need to have the shadows RS's to read the
primary RS's log.
> Enis:
> I am assuming a random distribution of secondary regions per above. In
this case, for replication=3D2, a region server will
> have half of it's regions in primary and the other in secondary mode. For
all the regions in the secondary mode, it has to
> tail the logs of the rs where the primary is hosted. However, since there
is no co-location guarantee, the primaries are
> also randomly distributed. For n secondary regions, and m region servers,
you will have to tail the logs of most of the RSs
> if n > m with a high probability (I do not have the smarts to calculate
the exact probability)

For hi-availability stale reads replicas (read replicas) it seems best to
assign the secondary regions assigned to the rs's where the HFiles are
hosted. Thus this approach would want to assign shadow regions like this
(this is the "random distribution of sedonary regions):
* X on RS-A(rep=3D0), RS-B(rep=3D1), and RS-C(rep=3D2); and
* Y on RS-A(rep=3D0), RS-D(rep=3D1), and RS-E(rep=3D2).

For the most efficient consistent read-recovery (shadow regions/memstores),
it would make sense to have them assigned to the rs's where the Hlogs are
local. Thus this approach would want to assign shadow regions for regions
X, Y, and Z on RS-L and RS-M.

A simple optimal solution for both read replicas and shadow regions would
be to assign the regions and the HLog to the same set of machines so that
the RS's for the logs and region x, y, and z hosted are on the same
machines -- let's say RS-A, RS-H, and RS-I.  This has some non-optimal
balancing ramifications upon machine failure -- the work of RS-A would be
split between to RS-H and RS-I.

A more complex solution for both would be to choose machines for the
purpose they are best suited for.  Read replicas are hosted on their
respective machines, and shadow region memstores on the hlog's rs's.
 Promotion becomes a more complicated dance where upon RS-A's failure, we
have the log tailing shadow region catchup and perform a flush of the
affected memstores to the appropriate hfds affinity group/favored nodes.
 So the shadow memstore for region X would flush the hfile to A,B,C and
region Y to A,D,E.  Then the read replicas would be promoted (close
secondary, open as primary) based on where the regions/hfiles's affinity
group.  This feels likes an optimization done one the 2rd or 3rd rev.

Jon

--=20
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

--001a11c3bea098d87704ec9b05ab--