hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ralph H Castain (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2911) Hamster: Hadoop And Mpi on the same cluSTER
Date Wed, 11 Apr 2012 21:29:17 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251937#comment-13251937
] 

Ralph H Castain commented on MAPREDUCE-2911:
--------------------------------------------

Hi Steve

{quote}If you look at the UK university grid http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html
you can see that although there are lots of clusters, they are of limited storage capacity
-that storage also forces you to choose where to run the work or rely on job preheating to
pull it in from RAL or elsewhere. (latency to do this is lower than pulling off tape). You
can also see that there are lot of jobs in the queues, including short-lived health tests
that verify work reaches the expected answers. I don't know about the duration/needs of the
actual work jobs.

When you consider job startup delays you have to look at time to fetch data over long-haul
connections, maybe compile code for target cluster, and recognise that without a SAN you can't
expect uniform access times to all data.
{quote}

A grid is very different from an HPC cluster, which are far more common (grids have been dying
out over the last few years). We never see data pulled over long-haul connections - frankly,
you don't see people doing it any more on grids either due to the unreliability and delays
in delivery. HPC clusters are almost always homogeneous (I think I've seen two heterogeneous
HPC clusters outside of a lab so far), and generally are backed by a parallel file system
that actually does provide pretty uniform access times. Remember: MPI jobs use MPI-IO to fetch/write
data, and they write a lot more data than they read (as per my prior note).

Thus, once an allocation is given, there is no startup delay like you describe. There is some
time required to load binaries and libs onto each node, but that scales well and goes very
fast. As per my other note, we figured out how to solve that a while back. :-)


{quote}What you would get from MPI over hadoop is the ability to run MPI work on the cluster
-a cluster which, if it also had infiniband on, would have low-latency interconnections. (yes,
there is a cost for that, but you may want it for a shared cluster).
{quote}

Agreed - so long as the MPI job is small enough, it should work.

{quote}What about an MPI mechanism that has a Grid Scheduler that block-rents a set of machines
that an then be used for multiple jobs off the MPI queue, and which aren't released after
each job? Once the capacity on the hosts is allocated, health checks can verify the machines
work properly, then it can await work. The scheduler can look at the pending queue and flex
its set of machines based on expected load?

Job startup would be reduce to the time to push out work to the pre-allocated hosts, which
doesn't need to rely on heartbeats and could use Zookeeper or other co-ordination services.

This wouldn't be a drop in replacement for one of the big supercomputing clusters, but it
would let people run MPI jobs within a Hadoop cluster.
{quote}

I'm not sure how that would work - I guess you would have to interface something like OGE/SGE
to Yarn so that it could "rent" machines from Yarn? As Milind noted, that interface is non-trivial
today. I've talked to the GE folks about it (as well as to the other major HPC RM orgs), but
they don't have much interest in providing such a capability - they are far more interested
in the reverse approach (i.e., running MR on an HPC cluster).

Situation could change as time passes and the interface stabilizes/becomes easier.

HTH
Ralph

                
> Hamster: Hadoop And Mpi on the same cluSTER
> -------------------------------------------
>
>                 Key: MAPREDUCE-2911
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2911
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv2
>    Affects Versions: 0.23.0
>         Environment: All Unix-Environments
>            Reporter: Milind Bhandarkar
>            Assignee: Ralph H Castain
>             Fix For: 0.24.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> MPI is commonly used for many machine-learning applications. OpenMPI (http://www.open-mpi.org/)
is a popular BSD-licensed version of MPI. In the past, running MPI application on a Hadoop
cluster was achieved using Hadoop Streaming (http://videolectures.net/nipsworkshops2010_ye_gbd/),
but it was kludgy. After the resource-manager separation from JobTracker in Hadoop, we have
all the tools needed to make MPI a first-class citizen on a Hadoop cluster. I am currently
working on the patch to make MPI an application-master. Initial version of this patch will
be available soon (hopefully before September 10.) This jira will track the development of
Hamster: The application master for MPI.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message