hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zoltán Tóth-Czifra <zoltan.tothczi...@softonic.com>
Subject RE: Complex MapReduce applications with the streaming API
Date Tue, 27 Nov 2012 17:35:21 GMT
Hi,

Thanks, the self-referencing subworkflow is a good idea, it never occured to me.
However, I'm still expecting something that is more light-weight, with no Oozie or external
tools.

My best idea now is simply abstracting the exec call in my script that submits the job (hadoop
jar hadoop-streaming.jar ...), extracing JobId from the output, then abstracing another exec
(hadoop job -counter ....) which can give me info about the counters. Is this the best option?

Thanks!

________________________________
From: Alejandro Abdelnur [tucu@cloudera.com]
Sent: Tuesday, November 27, 2012 6:10 PM
To: common-user@hadoop.apache.org
Subject: Re: Complex MapReduce applications with the streaming API

> Using Oozie seems to be an overkilling for this application, besides, it doesn't support
"loops"
> so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design (early prototypes supported
loops). The idea was that you didn't want never ending workflows. To this end, Coordinator
Jobs address the recurrent run of workflow jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow invoking to itself
as a sub-workflow. Just make sure you define properly your exit condition.

If you have additional questions, please move this thread to the user@oozie.apache.org<mailto:user@oozie.apache.org>
alias.


Thx


On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra <zoltan.tothczifra@softonic.com<mailto:zoltan.tothczifra@softonic.com>>
wrote:
Hi everyone,

Thanks in advance for the support. My problem is the following:

I'm trying to develop a fairly complex MapReduce application using the streaming API (for
demonstation purposes, so unfortunately the "use Java" answer doesn't work :-( ). I can get
one single MapReduce phase running from command line with no problem. The problem is when
I want to add more MapReduce phases which use each others output, and I maybe even want to
do a recursion (feed the its output to the same phase again) conditioned by a counter.

The solution in Java MapReduce is trivial (i.e. creating multiple Job instances and monitoring
counters) but with the streaming API not quite. What is the correct way to manage my application
with its native code? (Python, PHP, Perl...) Calling shell commands from a "controller" script?
How should I obtain counters?...

Using Oozie seems to be an overkilling for this application, besides, it doesn't support "loops"
so the recusrsion can't really be implemented.

Thanks a lot!
Zoltan



--
Alejandro

Mime
View raw message