hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ralph H Castain (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2911) Hamster: Hadoop And Mpi on the same cluSTER
Date Wed, 11 Apr 2012 21:17:18 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251927#comment-13251927

Ralph H Castain commented on MAPREDUCE-2911:

Appreciate your thoughts. I think we generally agree on purpose, but maybe not so much on
method. In my mind, bringing map-reduce to the MPI world in a first-class manner is a simpler,
and ultimately more useful, solution as HPC clusters already exist and organizations know
how to manage them. Certainly won't be true at places like Yahoo, but much of the business
world has HPC systems (albeit of small size) in various depts.

{quote}MPI jobs typically do run for long times, though surveys report that roughly 70% of
them run for one hour or less (but more than a few minutes).

Like I said, I'm sure it matters in the HPC world, but I'm pretty sure it's much less of an
issue in the Hadoop world even as we work to improve container launch etc. As a result, I'm
not very worried (at this stage) about this aspect for Hamster on YARN.

I think this is largely an issue of scale. While Hadoop is deployed on large clusters, it
seems to generally run in small jobs - i.e., a job consisting of 100 procs would be considered
fairly large. In the HPC world, a 100 proc job is considered tiny. Those 1 hour jobs typically
consist of hundreds to thousands of processes. Using the observed scaling behavior, Yarn would
take the better part of an hour to launch and wireup an MPI job of that size.

{quote} the requirement that we launch one container at a time (and copy binaries et al),
will be difficult to overcome {quote}

{quote} I'm not sure I follow. There is no requirement to launch one container at a time,
how did that come up? An ApplicationMaster can, and should, launch multiple containers via
threads etc. The MR AppMaster does that already.

Yes/no. Unfortunately, MPI procs need to know who is collocated with them on a node at startup
- otherwise, you make a major sacrifice in performance. So the AM cannot start launching until
ALL nodes have been allocated. While it's true you can then make a threaded launcher to help
reduce the time, it's still pretty much a linear launch pattern when you plot it out.

the really big clusters do this with sophisticated network design (limiting the number of
nodes served by each NFS server, multiple subnets) and TCP-over-IB protocols running at ~50Gbps,
so systems connected via lesser networks can't compete

That's fundamentally the reason Hadoop MapReduce Classic and YARN use HDFS to make the system
more scalable/easy-to-manage with the other characteristics it brings along.

Again, a question of perspective. The HPC world doesn't use the TCP networks for storing data
files - only application programs. The data sits on parallel file systems that experiments
have shown to be comparable or slightly faster than HDFS, and are easily managed. Again, HPC
depts are familiar with these systems and know how to manage them.

Keep in mind that HPC applications consume only kilobytes to a few megabytes of input data
- but generate petabytes of output. So the problem is the inverse of what HDFS attempts to
address. Where HDFS might come into play in HPC is therefore more in the visualization phase,
where those petabytes are consumed to produce megabyte-sized images at video rates.

The interest I've received has come more from the data analysis folks who have large amounts
of data, but wish to utilize existing HPC clusters to analyze it with MR techniques. They
may or may not use HDFS to do so - depends on the system administration.

{quote}please feel free to consider adding node-to-node collective communications

In the YARN design, the wireup is facilitate via the application's AppMaster. Thus each worker
process just needs to share it's info with it's AppMaster for wireup. Again, the MR AM already
does this for MR jobs.

And therein lies the problem. Each worker process has to contact the AM, thus requiring a
socket be opened and the required endpoint info sent to the AM, which must consume it. The
AM must then send the aggregated info separately to each process. Result: quadratic scaling.
Threading the AM helps a bit, but not a whole lot.

To give you an example, consider again my benchmark system (3k-node, 12k-process, launch and
wireup in 5sec). Using Yarn, the launch + wireup time computes to nearly an hour. We are very
confident in those numbers because we have tested similar launch methods back in the days
before we updated the OMPI architecture. In fact, we updated our architecture in response
to those tests.

So for small jobs, Yarn is fine. My concern is to provide an environment that can both run
the small Hadoop MR jobs *and* support the larger HPC jobs, all on the same cluster. Creating
a medium-performance version of MPI for Hadoop under Yarn is a fairly low priority, to be


> Hamster: Hadoop And Mpi on the same cluSTER
> -------------------------------------------
>                 Key: MAPREDUCE-2911
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2911
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv2
>    Affects Versions: 0.23.0
>         Environment: All Unix-Environments
>            Reporter: Milind Bhandarkar
>            Assignee: Ralph H Castain
>             Fix For: 0.24.0
>   Original Estimate: 336h
>  Remaining Estimate: 336h
> MPI is commonly used for many machine-learning applications. OpenMPI (http://www.open-mpi.org/)
is a popular BSD-licensed version of MPI. In the past, running MPI application on a Hadoop
cluster was achieved using Hadoop Streaming (http://videolectures.net/nipsworkshops2010_ye_gbd/),
but it was kludgy. After the resource-manager separation from JobTracker in Hadoop, we have
all the tools needed to make MPI a first-class citizen on a Hadoop cluster. I am currently
working on the patch to make MPI an application-master. Initial version of this patch will
be available soon (hopefully before September 10.) This jira will track the development of
Hamster: The application master for MPI.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message