hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shi Yu <sh...@uchicago.edu>
Subject Re: Nested map reduce job
Date Sat, 05 May 2012 21:32:29 GMT
A quick glance at your problem indicates that you might have a design 
problem with your code. In my opinion you should avoid nested Map/Reduce 
job.  You could use chain Map/Reduce,  but the nested or recursive 
structure is not suggested.  I don't know how you implemented your 
nested M/R job, maybe showing some code fragment?  For the permutation 
problem, it might be easier to split the permutation candidates for 
Mappers, then sort (discard duplicated values) at reducers.   
Permutation of 3 million values seems huge, are you sure you want to 
permutation all 3 million values (what problem requires that 
permutation) or you just need to permute a small set sampled from those 
3 million values?


On 5/5/2012 4:16 PM, venkataswamy wrote:
> Hi,
>     I encountered a strange issue in developing a system. I have data where
> reducer recieves about 3 millions values. The reducer emits all the
> permutations of the values.
> Reducer{
>     List<values>
>     FindPermutations(List<values>)
>     foreach( permutation )
>         emit( key, permutation )
> }
> It is feasible to hold values in memory to calculate permutations if the
> number of values are low i.e. say less than 10,000. Otherwise, this is not
> scalable even in computational point of view.
> I tried to write the values into a file and move it to HDFS. Start a new
> mapreduce job for permutation from the reducer, this distributes the load of
> the reducer among available machines. let me call it as nested mapreduce
> job. The task waits until the nested job completes and uses the obtained
> result to emit the permutations. The parent job's task stills idle, so the
> nested job's tasks can run on the same tasktracker, but the tasktracker is
> not doing it. Is there a way to signal tasktracker that the current task is
> paused or sitting idle, but not to terminate.
> All the available tasktrackers are running parent mapreduce job's tasks and
> the nested mapreduce job never getting resources to start and falling into
> deadlock scenario.
> I can suspend parent task after starting a nested job for permutations, but
> it does continue from the same instruction when it resumes. In simple words,
> the parent task is not pausing but suspending.
> Anybody got into this situation. If you have any thoughts on it please post
> it here.
> All your help is appreciated.
> Thanks,
> Venkat

View raw message