hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ralph H Castain (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2911) Hamster: Hadoop And Mpi on the same cluSTER
Date Wed, 11 Apr 2012 18:53:19 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251839#comment-13251839

Ralph H Castain commented on MAPREDUCE-2911:

Hi Arun

No offense taken at all - hopefully both directions. I'm pretty agnostic on these things after
working with them all over the years. Every one has its pros/cons. :-)

I'll try to address your points in sequence. I'd be happy to drop by some time (I'm in SF
frequently these days) and chat about it (higher bandwidth is often helpful).

First, people on HPC clusters also sit for some time in the queue waiting for resources. It
is a very odd day when you can just jump onto a system! However, launch time is still a critical
issue on such clusters since they typically operate at scale. Seconds add up over time.

MPI jobs typically do run for long times, though surveys report that roughly 70% of them run
for one hour or less (but more than a few minutes). Although we might agree that a few seconds
shouldn't matter, the fact is that users will loudly protest such delays, especially when
many of the MPI jobs are actually run interactively during development. Startup time is a
very sensitive issue in HPC - my back has the scars to prove it.

Without changing the basic design of Yarn, I don't see how Yarn can ever really compete in
this area. The fact that the RM has to wait to be called via heartbeat by the nodes in order
to do the allocation is a bottleneck, and the requirement that we launch one container at
a time (and copy binaries et al), will be difficult to overcome.


HPC systems are always supported by NFS mounts of home directories as well as system libraries
- we never use local disks due to the management headache they create. User requirements for
different library versions are handled via the "modules" system. We have found ways to make
this scalable to over 100k nodes without significant startup time. Of course, the really big
clusters do this with sophisticated network design (limiting the number of nodes served by
each NFS server, multiple subnets) and TCP-over-IB protocols running at ~50Gbps, so systems
connected via lesser networks can't compete. On the other hand, smaller clusters do quite
well with appropriately laid out Ethernet support.

Thus, we never pre-distribute binaries, even on the largest clusters (jars aren't an issue
as you have to look hard to find any Java applications in HPC). We also never build static
libraries as memory is at a premium for HPC jobs, so everything is dynamically linked. On
the larger systems, there sometimes are bandwidth restrictions on the NFS channels - in these
cases, we disable dlopen so that the MPI libraries are rolled up into one dynamic lib to minimize
the number of NFS calls. The launch example I gave was from such a configuration.


Clearly, Yarn could provide collective communication support if it chose to do so. However,
it would result in a fundamental change in the Yarn architecture.

In the HPC world, the RMs operate in one of two modes. Some, like SLURM, provide their own
wireup support. In these cases, each process provides its endpoint information to the local
RM daemon. Those daemons in turn share that info across all other daemons hosting processes
from that job using collective operations designed for large scale. Application processes
are then either passed the resulting info (assembled from all procs in the job), or can query
the local RM daemon to obtain the pieces they need.

Some RMs, like Torque, do not offer these services. In those cases, the MPI implementation
itself provides it by first launching its own daemons on each node in the job. These daemons
then assume the role played by the above SLURM daemon, circulating the endpoint info using
similar collective operations. While it might seem that this has to take longer due to the
additional daemon launch, it actually is competitive as the startup time (given the scalable
launch) for the daemons is really quite small, and the time required to share info across
the procs (called the "modex" operation) is much longer.

Translating these methods to Yarn would require that the Nodemanagers learn how to communicate
with each other, which seems to me to break the Yarn design. I admit I could be wrong here,
so please feel free to consider adding node-to-node collective communications. It isn't terribly
hard to do, but it does take some work to make it robust.


I have looked at HoD and can understand the problems. The mistake (IMHO) was to attempt to
run it directly under Torque. This has its limitations. We avoid those by actually using an
adaptation of OMPI's mpirun tool, which eliminates most of the problems. So far, it seems
to be working quite well, though there is more to be done and always unforeseen problems to


I suspect there are folks out there who will use Yarn and have interest in using MPI. However,
my assignment is to focus on creating the ability to dual-use clusters as this has become
a limiting issue, so that's where my emphasis currently lies. Once that is complete, and assuming
the powers-that-be agree, I will take another look at it. Perhaps by then the pain will have
lessened - frankly, I find Yarn to be very difficult to work with at the moment.


> Hamster: Hadoop And Mpi on the same cluSTER
> -------------------------------------------
>                 Key: MAPREDUCE-2911
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2911
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv2
>    Affects Versions: 0.23.0
>         Environment: All Unix-Environments
>            Reporter: Milind Bhandarkar
>            Assignee: Ralph H Castain
>             Fix For: 0.24.0
>   Original Estimate: 336h
>  Remaining Estimate: 336h
> MPI is commonly used for many machine-learning applications. OpenMPI (http://www.open-mpi.org/)
is a popular BSD-licensed version of MPI. In the past, running MPI application on a Hadoop
cluster was achieved using Hadoop Streaming (http://videolectures.net/nipsworkshops2010_ye_gbd/),
but it was kludgy. After the resource-manager separation from JobTracker in Hadoop, we have
all the tools needed to make MPI a first-class citizen on a Hadoop cluster. I am currently
working on the patch to make MPI an application-master. Initial version of this patch will
be available soon (hopefully before September 10.) This jira will track the development of
Hamster: The application master for MPI.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message