hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Söztutar <enis....@gmail.com>
Subject Re: [Shadow Regions / Read Replicas ] Block Affinity
Date Tue, 03 Dec 2013 19:37:21 GMT
Responses inlined.

On Mon, Dec 2, 2013 at 10:00 PM, Jonathan Hsieh <jon@cloudera.com> wrote:

> > Enis:
> > I was trying to refer to not having co-location constraints for secondary
> replicas whose primaries are hosted by the same
> > RS. For example, if R1(replica=0), and R2(replica=0) are hosted on RS1,
> R1(replica=1) and R2(replica=1) can be hosted by RS2
> > and RS3 respectively. This can definitely use the hdfs block affinity
> work though.
> This particular example doesn't have enough to tease out the different
> ideal situations.  Hopefully this will help:
> We have RS-A hosting regions X and Y.  With affinity groups let's say
> RS-A's logs are written to RS-A, RS-L, and RS-M.
> Let's also say that X is written to RS-A, RS-B and RS-C and Y is to RS-A,
> RS-D, RS-E.
> >> Jon:
> >> However, I don't think we get into a situation where all RS's must read
> all other RS's logs – we only need to have the shadows RS's to read the
> primary RS's log.
> > Enis:
> > I am assuming a random distribution of secondary regions per above. In
> this case, for replication=2, a region server will
> > have half of it's regions in primary and the other in secondary mode. For
> all the regions in the secondary mode, it has to
> > tail the logs of the rs where the primary is hosted. However, since there
> is no co-location guarantee, the primaries are
> > also randomly distributed. For n secondary regions, and m region servers,
> you will have to tail the logs of most of the RSs
> > if n > m with a high probability (I do not have the smarts to calculate
> the exact probability)
> For hi-availability stale reads replicas (read replicas) it seems best to
> assign the secondary regions assigned to the rs's where the HFiles are
> hosted. Thus this approach would want to assign shadow regions like this
> (this is the "random distribution of sedonary regions):
> * X on RS-A(rep=0), RS-B(rep=1), and RS-C(rep=2); and
> * Y on RS-A(rep=0), RS-D(rep=1), and RS-E(rep=2).
> For the most efficient consistent read-recovery (shadow regions/memstores),
> it would make sense to have them assigned to the rs's where the Hlogs are
> local. Thus this approach would want to assign shadow regions for regions
> X, Y, and Z on RS-L and RS-M.

I don't this this is the case. Recovery is a multi step process, and
reading and
applying the log is only one step. After the region is opened, you
definitely want
the data files to be local as much as possible. Considering the relative
sizes of
the files and the WALs, I think we will always want to use hdfs affinity
groups for
hfiles rather than hlogs to assign secondary replicas. This will help both
stale reads
and local reads in case of a promotion to primary.

> A simple optimal solution for both read replicas and shadow regions would
> be to assign the regions and the HLog to the same set of machines so that
> the RS's for the logs and region x, y, and z hosted are on the same
> machines -- let's say RS-A, RS-H, and RS-I.  This has some non-optimal
> balancing ramifications upon machine failure -- the work of RS-A would be
> split between to RS-H and RS-I.

I don't think we want this. This implies that we are creating region
assignment groups ( group-based
assignment as described in the doc). The problem is that in case of a
crash, you cannot evenly
distribute out the regions from the primary otherwise you will still end up
tailing all the logs for
all the region servers. Plus if you want to load balance, it will be even
harder to satisfy the constraints while
keeping the balance.

In your example, if you have replication=2 for example, we cannot simply
move all the primary regions
of RS-A to RS-H, which will then suddenly have twice the number of regions.

> A more complex solution for both would be to choose machines for the
> purpose they are best suited for.  Read replicas are hosted on their
> respective machines, and shadow region memstores on the hlog's rs's.
>  Promotion becomes a more complicated dance where upon RS-A's failure, we
> have the log tailing shadow region catchup and perform a flush of the
> affected memstores to the appropriate hfds affinity group/favored nodes.
>  So the shadow memstore for region X would flush the hfile to A,B,C and
> region Y to A,D,E.  Then the read replicas would be promoted (close
> secondary, open as primary) based on where the regions/hfiles's affinity
> group.  This feels likes an optimization done one the 2rd or 3rd rev.

I think we do not want to differentiate between RS's by splitting them
between primaries and shadows.
This will complicate provisioning, administration, monitoring and load
balancing a lot, and will not achieve
very cheap secondary region promotions (because you have to move the region
still as you described).

> Jon
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message