hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Narayanan K <knarayana...@gmail.com>
Subject Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster
Date Fri, 01 Jul 2011 07:27:55 GMT
Hi Harsh

Thanks for the quick response...

Have a few clarifications regarding the 1st point :

Let me tell the background first..

We have actually set up a Hadoop cluster with HBase installed. We are
planning to load Hbase with data and perform some
computations with the data and show up the data in a report format.
The report should be accessible from outside the cluster and the report
accepts certain parameters to show data, that will in turn pass on these
parameters to the hadoop master server where a mapreduce job will be run
that queries HBase to retrieve the data.

So the report will be run from a different machine outside the cluster. So
we need a way to pass on the parameters to the hadoop cluster (master) and
initiate a mapreduce job dynamically. Similarly the output of mapreduce job
needs to tunneled into the machine from where the report was run.

Some more clarification I need is : Does the machine (outside of cluster)
which ran the report, require something like a Client installation which
will talk with the Hadoop Master Server via TCP???  Or can it can run a job
in hadoop server by using a passworldless scp to the master machine or
something of the like.


Regards,
Narayanan




On Fri, Jul 1, 2011 at 11:41 AM, Harsh J <harsh@cloudera.com> wrote:

> Narayanan,
>
>
> On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K <knarayanan88@gmail.com>
> wrote:
> > Hi all,
> >
> > We are basically working on a research project and I require some help
> > regarding this.
>
> Always glad to see research work being done! What're you working on? :)
>
> > How do I submit a mapreduce job from outside the cluster i.e from a
> > different machine outside the Hadoop cluster?
>
> If you use Java APIs, use the Job#submit(…) method and/or
> JobClient.runJob(…) method.
> Basically Hadoop will try to create a jar with all requisite classes
> within and will push it out to the JobTracker's filesystem (HDFS, if
> you run HDFS). From there on, its like a regular operation.
>
> This even happens on the Hadoop nodes itself, so doing so from an
> external place as long as that place has access to Hadoop's JT and
> HDFS, should be no different at all.
>
> If you are packing custom libraries along, don't forget to use
> DistributedCache. If you are packing custom MR Java code, don't forget
> to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
> API methods.
>
> > If the above can be done, How can I schedule map reduce jobs to run in
> > hadoop like crontab from a different machine?
> > Are there any webservice APIs that I can leverage to access a hadoop
> cluster
> > from outside and submit jobs or read/write data from HDFS.
>
> For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
> It is well supported and is very useful in writing MR workflows (which
> is a common requirement). You also get coordinator features and can
> schedule similar to crontab functionalities.
>
> For HDFS r/w over web, not sure of an existing web app specifically
> for this purpose without limitations, but there is a contrib/thriftfs
> you can leverage upon (if not writing your own webserver in Java, in
> which case its as simple as using HDFS APIs).
>
> Also have a look at the pretty mature Hue project which aims to
> provide a great frontend that lets you design jobs, submit jobs,
> monitor jobs and upload files or browse the filesystem (among several
> other things): http://cloudera.github.com/hue/
>
> --
> Harsh J
>

Mime
View raw message