hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@hortonworks.com>
Subject Re: MRv1 JT Availability (was [DISCUSS] Spin out MR, HDFS and YARN ...)
Date Tue, 18 Sep 2012 06:34:38 GMT
> I just want to be more clear in what I meant by "HA JobTracker for
> parity with HDFS". There should be no need to quiesce the JT with a
> highly available NameNode, and restarting jobs from the beginning if
> the JT crashes isn't good enough to meet the user expectations implied
> by "high availability", at least those who are our internal customers.

Hi Andrew.

A couple of points...

1) Quiescing the JT can be slightly refined, but the focus there is to have a reasonable behavior
if the storage layer becomes unavailable or has not been started in a boot sequence.  This
is useful functionality that simply addresses a different set of failure cases. 

2) I agree that restarting jobs is not desirable.  This is an independent issue we've been
working on in YARN.  The key here is simply sorting out how you manage state efficiently on
ZK or HDFS.  The good news is HBase demonstrates how this can be done (region servers and
master designs.

> I meant hot JT failover, that there is a primary and backup JT, that
> they share state sufficient for the backup to take over immediately if
> the primary fails, and that the TTs and JobClients both will switch
> seamlessly to the backup should their communications with the primary
> fail.


I think state sharing is very expensive and error prone.  These kind of hot-hot solutions
are almost an anti-patern IMO.  In the case of HDFS we are half way through implementing this,
so we don't need to reopen that.  One can argue that HBase and HDFS might need them, given
the desire for MANY very low latency requests,  But HBase hasn't opted for this complexity
yet I'd observe and I'm more tempted to emulate its designs than HDFS's for MR.

For MR, a good simple cold failover design should be MUCH easier to implement and debug and
maintain.  Running jobs need not be lost (their state can be stored in durable storage or
recovered from the cluster) and the time to detect failure should end up dominating the time
to recover, much like what we are seeing in HDFS testing.  So for small clusters there should
be zero reason to do hot-hot.

I think we are much better off focusing on simple design patterns using the storage systems
we have (ZK and HDFS) to restore state quickly on failover.  The HBase region server and masters
are good examples of good design in this area that we should emulate here IMO.  MR has much
simpler problems and any investment we make in improving WALs and state management on HDFS
is going to make HBase and every new compute model ported to YARN better.


On Sep 9, 2012, at 10:57 AM, Andrew Purtell <apurtell@apache.org> wrote:

> Hi Arun,
> 
> On Mon, Sep 3, 2012 at 4:02 AM Arun C Murthy wrote:
>>> On Sep 1, 2012, at 6:32 AM, Andrew Purtell wrote:
>>> I'd imagine such a MR(v1) in Hadoop, if this happened, would concentrate on
>>> performance improvements, maybe such things as alternate shuffle plugins.
>>> Perhaps a HA JobTracker for parity with HDFS.
>> 
>> Lots of this has already happened in branch-1, please look at:
>> # JT Availability: MAPREDUCE-3837, MAPREDUCE-4328, MAPREDUCE-4603 (WIP)
> 
> Thanks for the pointers!
> 
> I just want to be more clear in what I meant by "HA JobTracker for
> parity with HDFS". There should be no need to quiesce the JT with a
> highly available NameNode, and restarting jobs from the beginning if
> the JT crashes isn't good enough to meet the user expectations implied
> by "high availability", at least those who are our internal customers.
> I meant hot JT failover, that there is a primary and backup JT, that
> they share state sufficient for the backup to take over immediately if
> the primary fails, and that the TTs and JobClients both will switch
> seamlessly to the backup should their communications with the primary
> fail. I'd expect state sharing to limit scalability to the small- and
> medium-cluster range, and that's fine, YARN is the answer for
> scalability issues in the large and largest clusters already.
> 
>> # Performance - backports of PureJavaCrc32 in spills (MAPREDUCE-782), fadvise backports
(MAPREDUCE-3289) and other several misc. fixes.
> 
> -- 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)


Mime
View raw message