bigtop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Boudnik <...@apache.org>
Subject Re: adding HA monitoring to bigtop
Date Wed, 10 Oct 2012 23:13:50 GMT
Steve,

great stuff. Here's my initial feedback:

1. I am not passing judgement about how the monitoring is done, although
something like Nagios would fill the bill good enough, IMO. Anyway... It seems
like this monitoring is very Hadoop HA specific, I would say that it is better
be kept in Hadoop in one form or another - hadoop/contrib seems like a good
place to start, In other words, I don't think this is generic enough
monitoring software to be included into the BigTop. Say, I'd be happy to
include Ganglia or some Nagios hooks for the same purposes.  Packaging for
this monitoring software can be of course added to the BigTop stack like we
are doing this for many other components - it looks very reasonable approach.

2. The failure inducing library seems like a great addition to the iTest. In
fact, if I were doing Hadoop fault injection again I would certainly go with
MOP'ping and Groovy-based framework, instead of AspectJ boredom. So, I like
the idea and it seems to fit very well with the original design ideas of the
iTest - let's add the library to the BigTop. There things to look at and
discuss of course but I like the overall idea!

To summarize: I'm rather negative on keeping the monitoring software as a part
of the BigTop; and I am quite positive on bring the testing lib as a part of
the iTest.

Cos

On Mon, Oct 08, 2012 at 10:03AM, Steve Loughran wrote:
> As you may have heard, the latest release of HDP-1 has some HA monitoring
> logic in it
> 
> Specifically
> 
>    1. a canary monitor for VMWare -a new init.d daemon that monitors HDFS,
>    JT or anything with 1 or more of (PID, port, URL) & stops singing when any
>    of them fails or a probe blocks for too long. There's some complex
>    lifecycle work at startup to deal with (a) slow booting services and (b)
>    the notion of upstream dependencies -there's no point reporting a JT
>    startup failure if the NN is offline, as that is what is blocking the JT
>    from opening its ports.
>    2. a Linux HA Resource Agent that uses the same liveness probes as the
>    canary monitor, but is invoked from a Linux HA bash script. This replaces
>    the init.d script on an HA cluster, relying on LinuxHA to start and stop it.
>    3. RPM packaging for these
> 
> Test wise, along with the unit tests of all the various probes, there's:
> 
>    1. A Groovy gui, Hadoop Availability Monitor, that can show what is
>    going on and attempt FS and JT operations. Useful for demos and monitoring
>    what is going on
>    2. An MR job designed to trigger task failures when the NN is down -by
>    executing FS operations in map or reduce phases. This is needed to verify
>    that the JT doesn't over-react when HDFS is down.
>    3. the beginnings of a library to trigger failures on different
>    infrastructure, "apache chaos". To date it handles vbox and human
>    intervention (it brings up a dialog). Manual is quite good to coordinate
>    physical actions like pulling power out.
>    4. Test cases that do things like ssh in to machines, kill processes &
>    verify the FS comes back up.
> 
> 
> There's some slides on this on slideshare, but the animation is missing -I
> can email out the full PPTX if someone really wants to see it.
> 
> http://www.slideshare.net/steve_l/availability-and-integrity-in-hadoop-strata-eu-edition
> 
> I think the best home for this is not some hadoop/contrib package but
> bigtop -it fits in with the notion of bigtop being where the daemon scripts
> live, but where the RPMs come from. It also fits in with the groovy test
> architecture -being able to hand closures down to trigger different system
> failures turns out to be invaluable.
> 
> I think the chaos stuff, if it were actually expanded to work with other
> virtual infras (by way of jclouds), and more physical stuff (exec fencing
> device scripts, ssh into to linksys routers and cause network partitions)
> could be useful for anyone trying to create system failures during test
> runs. It's slightly different from the Netflix Chaos Monkey in that the
> monkey kills production servers on a whim; this could do the same if a
> process were set up to use the scripts -for testing I want choreographed
> outages, and more aggressive failure simulation during large scale test
> runs of the layers above. I also want to simulate more failures than just
> "VM goes away", as its those more complex failures that show up in the
> physical world (net partition, process hang, process die).
> 
> If people think that this looks like a good match, I'll go  into more
> detail on what there is and what else could be done. One thing I'd like to
> do is add a new reporter to the canary monitor daemon that handles service
> failures by killing and restarting the process. This could be deployed on
> all worker nodes to keep an eye on the TT, DN & region server, for better
> automated handling of things like the HTTPD in the TT blocking all callers,
> as Jetty is known to do from time to time.
> 
> -Steve

Mime
View raw message