hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ralph Castain <...@open-mpi.org>
Subject RM & AM for Hadoop
Date Fri, 30 Dec 2011 16:02:22 GMT
Hi folks

I have been familiarizing myself with the Hadoop 0.23 code tree, and found myself wondering
if people were aware of the tools commonly used by the HPC community as I worked my way thru
the code. Just in case the community isn't, I thought it might be worth a very brief summary
of the field.

The HPC community has a long history of implementing and fielding resource and application
management tools that are fault tolerant with respect to process and node failures. These
systems readily scale to the 30K node and 100K process size today, and are running on 10s
of thousands of clusters around the world. Efforts to extend capabilities to the 100K node,
1M process level are aggressively being pursued and expected to be fielded in the next two

Although these systems support MPI applications, there is nothing exclusively MPI about them.
All will readily execute any application, including a MapReduce operation. HDFS support is
lacking, of course, but can be readily added in most cases.

My point here is that replicating all the fault tolerant resource and application management
capabilities of the existing systems is a major undertaking that may not be necessary. The
existing systems represent hundreds of man-years of effort, coupled with thousands of machine-years
of operational experience - together, these provide a level of confidence that will be hard
to duplicate.

In the HPC world, resource managers and application managers are typically separate entities.
Although many people use the AM that comes with a particular RM, almost all RM/AM systems
support a wide range of combinations so that users can pick-and-choose the pairing that best
fits their needs. Both proprietary (e.g., Moab and LSF) and open-source (e.g., SLURM and Gridengine)
versions of RMs and AMs are available, each with differing levels of fault tolerance.

For those not wanting to deal with the peculiarities of each combination, abstraction layers
are available. For example, Open MPI's "mpirun" tool will transparently interface to nearly
all available RMs and AMs, executing your application in the same manner regardless of the
underlying systems. OMPI provides its own fault tolerance capabilities to ensure commonality
across environments, and is capable of restarting and rewiring applications as required. In
addition, OMPI allows the definition of "fault groups" (i.e., failure dependencies between
nodes) to further protect against cascading failures, and a variety of software and hardware
sensors to detect deteriorating behavior prior to actual node/process failure.

Multiple error response strategies are available from most existing systems, including OMPI.
These are user-selectable at time of execution and range from simple termination of the entire
application, to restarting processes on the current node (assuming that node is up or restarts),
shifting processes to nodes currently being used by the application, and shifting processes
to available "backup" nodes.

I can provide more info as requested, but wanted to at least make you aware of the existing
capabilities. A quick "poll" of HPC RM/AM providers indicates a willingness to add HDFS support
- they haven't done so to-date because nobody asked them to do so, and they tend to respond
to user demand. No technical barrier is immediately apparent.


View raw message