hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dieter Plaetinck <dieter.plaeti...@intec.ugent.be>
Subject can a `hadoop -jar streaming.jar` command return when a job is packaged and submitted?
Date Fri, 06 May 2011 14:09:41 GMT
Hi,
I have a script something like this (simplified):

for i in $(seq 1 200); do
   regenerate-files $dir $i
   hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
        -D mapred.job.name="$i" \
        -file $dir \
        -mapper "..." -reducer "..." -input $i-input -output $i-output

so I want to launch 200 hadoop jobs, each which needs files from $dir, more specifically some
files in $dir are being generated to be used with
job $i (and only that job)
the problem is, generating those files takes some time.
currently the hadoop command packs and submits the job and then waits for the job to complete.
This causes the regenerate-files program to cause needless delays.
Is there a way to make the hadoop jar command return when the job is packaged and submitted?
I obviously cannot just background the hadoop calls, because that would start regenerating
the files while previous jobs are still being packaged.
I thought about using different directories per job to generate those files in, but that would
needlessly consume disk space, which is not good.

I've been looking at http://hadoop.apache.org/mapreduce/docs/current/streaming.html and googling
for answers, but couldn't find a solution.
I found http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.html
but that doesn't seem to work with streaming.

thanks,
Dieter

Mime
View raw message