hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: Is Hadoop the right platform for my HPC application?
Date Tue, 13 Sep 2011 19:22:12 GMT
Another option to think about is that there is a Hamster project ( MAPREDUCE-2911 <https://issues.apache.org/jira/browse/MAPREDUCE-2911>
) that will allow OpenMPI to run on a Hadoop Cluster.  It is still very preliminary and will
probably not be ready until Hadoop 0.23 or 0.24.

There are other processing methodologies being developed to run on top of YARN (Which is the
resource scheduler put in as part of Hadoop 0.23) http://wiki.apache.org/hadoop/PoweredByYarn

So there are even more choices coming depending on your problem.

--Bobby Evans

On 9/13/11 12:54 PM, "Parker Jones" <zoubidoo@hotmail.com> wrote:

Thank you for the explanations, Bobby.  That helps significantly.

I also read the article below which gave me a better understanding of the relative merits
of MapReduce/Hadoop vs MPI.  Alberto, you might find it useful too.

There is even a MapReduce API built on top of MPI developed at Sandia.

So many options to choose from :-)


> From: evans@yahoo-inc.com
> To: common-user@hadoop.apache.org
> Date: Mon, 12 Sep 2011 14:02:44 -0700
> Subject: Re: Is Hadoop the right platform for my HPC application?
> Parker,
> The hadoop command itself is just a shell script that sets up your classpath and some
environment variables for a JVM.  Hadoop provides a java API and you should be able to use
to write you application, without dealing with the command line.  That being said there is
no Map/Reduce C/C++ API.  There is libhdfs.so that will allow you to read/write HDFS files
from a C/C++ program, but it actually launches a JVM behind the scenes to handle the actual
> As for a way to avoid writing your input data into files, the data has to be distributed
to the compute nodes some how.  You could write a custom input format that does not use any
input files, and then have it load the data a different way.  I believe that some people do
this to load data from MySQL or some other DB for processing.  Similarly you could do something
with the output format to put the data someplace else.
> It is hard to say if Hadoop is the right platform without more information about what
you are doing.  Hadoop has been used for lots of embarrassingly parallel problems.  The processing
is easy, the real question is where is your data coming from, and where are the results going.
 Map/Reduce is fast in part because it tries to reduce data movement and move the computation
to the data, not the other way round.  Without knowing the expected size of your data or the
amount of processing that it will do, it is hard to say.
> --Bobby Evans
> On 9/12/11 5:09 AM, "Parker Jones" <zoubidoo@hotmail.com> wrote:
> Hello all,
> I have Hadoop up and running and an embarrassingly parallel problem but can't figure
out how to arrange the problem.  My apologies in advance if this is obvious and I'm not getting
> My HPC application isn't a batch program, but runs in a continuous loop (like a server)
*outside* of the Hadoop machines, and it should occasionally farm out a large computation
to Hadoop and use the results.  However, all the examples I have come across interact with
Hadoop via files and the command line.  (Perhaps I am looking at the wrong places?)
> So,
> * is Hadoop the right platform for this kind of problem?
> * is it possible to use Hadoop without going through the command line and writing all
input data to files?
> If so, could someone point me to some examples and documentation.  I am coding in C/C++
in case that is relevant, but examples in any language should be helpful.
> Thanks for any suggestions,
> Parker

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message