hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Hadoop for Independant Tasks not using Map/Reduce?
Date Wed, 19 Aug 2009 23:43:31 GMT
Hadoop Streaming still expects to be doing "MapReduce". But you can hack
that definition, e.g., by emitting no output data and disabling reducing.

The number of map tasks to run will be controlled by the number of
"InputSplits" -- divisions of some arbitrary piece of input -- that a job
contains. By default, InputSplits are created based on the number of files
used. This is controlled by the InputFormat you select. You might want to
look at NLineInputFormat. This lets you write out a file where each line of
the file is a separate split. So you write a file to HDFS with a line (maybe
containing some arguments) for each instance of your program you want to
run. When you launch your job, point it at this input, and it'll launch the
desired number of copies of your program on a bunch of randomly selected
nodes from your cluster.

I don't know of any specific examples of this in use to point you to, but
you can certainly make a start of this.

- Aaron

On Wed, Aug 19, 2009 at 7:05 AM, Poole, Samuel [USA]

> I am new to Hadoop (I have not yet installed/configured), and I want to
> make sure that I have the correct tool for the job.  I do not "currently"
> have a need for the Map/Reduce functionality, but I am interested in using
> Hadoop for task orchestration, task monitoring, etc. over numerous nodes in
> a computing cluster.  Our primary programs (written in C++ and launched via
> shell scripts) each run independantly on a single node, but are deployed to
> different nodes for load balancing.  I want to task/initiate these processes
> on different nodes through a Java program located on a central server.  I
> was hoping to use Hadoop as a foundation for this.
> I read the following in the FAQ section:
> "How do I use Hadoop Streaming to run an arbitrary set of
> (semi-)independent tasks?
> Often you do not need the full power of Map Reduce, but only need to run
> multiple instances of the same program - either on different parts of the
> data, or on the same data, but with different parameters. You can use Hadoop
> Streaming to do this. "
> So, two questions I guess.
> 1.  Can I use Hadoop for this purpose without using Map/Reduce
> functionality?
> 2.  Are there any examples available on how to implement this sort of
> configuration?
> Any help would be greatly appreciated.
> Sam

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message