hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hsieh <...@cloudera.com>
Subject Re: [Shadow Regions / Read Replicas ] Wal per region?
Date Tue, 03 Dec 2013 21:58:31 GMT
On Tue, Dec 3, 2013 at 11:21 AM, Devaraj Das <ddas@hortonworks.com> wrote:

> On Mon, Dec 2, 2013 at 10:20 PM, Jonathan Hsieh <jon@cloudera.com> wrote:
> > With this in mind, I actually I making the case that we would group the
> all
> > the regions from RS-A onto the same set of preferred regions servers.
>  This
> > way we only need to have one or two other RS's tailing the RS.
> >
> > So for example, if region X, Y and Z were on RS-A and its hlog, the
> shadow
> > region memstores for X, Y, and Z would be assigned to the same one or two
> > other RSs.  Ideally this would be where the HLog files replicas have
> > locality (helped by favored nodes/block affinity).  Doing this, we hold
> the
> > number of readers on the active hlogs to a constant number, do not add
> any
> > new cross machine traffic (though tailing currently has costs on the NN).
> >
> >
> Yes, we did consider this but the issue is how much complex would the
> failure handling be in order to maintain the grouping of the regions. So,
> for example, if RS-A goes down, would the master be able to choose another
> RS-A' quickly to maintain the grouping of the regions in RS-A. Or do we
> then fallback to the regular single-region assignments and have the
> balancer group the regions back... What's the grouping size. The same
> issues apply to the assignments of the shadows.
> I'm not seeing how this is more complicated than the read replicas scheme
(and the potential fixup needed with replica_ids, and selection of a new
replica to replace the "fallen" primary) in similar scenarios.

I should address this more fully but in the shadow region/memstore design
but the my current idea for grouping is: we to assign shadow regions to the
nodes where the hlog is being replicated to.  Selection would be no more
complicated than what the master does now to select new regionservers after
a RS serving regions go down.  Upon a crash, the new shadow memstore
location selection would be done at the same place as the new primary RS
selection. (region plan generated by the balancer's). Grouping policy is
managed there with an obvious preferences for where the hlog hdfs block
replicas are located.

In the shadow memstore proposal, failures would force the relevant shadow
regions to get up to date with RS-A's crash final log point, and the
promotes to be the primary. (implying keeping the shadow memstore open and
turning it into the real memstore).

A more generic mechanism spurred on by this discussion would be to have the
shadow memstore RS catch up, flush the shadow memstore to HFiles, and then
open the region as a primary.  Since we control where we flush to at this
point in time, we could flush to ensure locality to the replica sets of the
particular regions.

I was leaning towards having shadow opens coupled with region opens.  So
when the region is opened on a new RS, open the shadow region elsewhere on
another RS.

I'm still considering whether the shadow memstore have a region->rs mapping
(each region gets assigned a single shadow rs) or if it instead it should
have a rs,table->rs (each RS gets assigned a shadow RS for a particular
table) mapping.

Finally, if we are in a case where the shadows memstores are down and the
primary goes down, we fall back to the existing recovery mechanism.

> But having said that, I agree that if the WALpr is expensive on non-SSD
> hardware for example, we need to address the grouping of regions issues.
> > One inefficiency we have is that if there is a single log per RS, we end
> up
> > reading all the logs to tables that may not have the shadow feature
> > enabled.  However, with HBase multi-wals coming, one strategy is to shard
> > wals to a number on the order of the number of disks on a machine (12-24
> > these days).  I think the a wal per namespaces (this could be used to
> have
> > a wal per table) of the hlog would make sense.  This way of shardind the
> > hlog would reduce the amount of reading of irrelevant log entries on a
> log
> > tailing scheme. It would have the added benefit of reducing the log
> > splitting work reducing MTTR and allowing for recovery priorities if the
> > primaries and shadows also go down.  (this is an generalization of the
> > separate out the META into a separate log idea).
> >
> > Jon.
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
> --
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message