hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Theodore Van Rooy" <munkey...@gmail.com>
Subject Re: Hadoop: Multiple map reduce or some better way
Date Wed, 26 Mar 2008 18:40:48 GMT
In my experience the advice above is good... the less reading and writing
that you have to do at each step the better.

While you could do map | reduce | map | reduce as you are proposing...
perhaps you could try several maps in a row

i.e.  map - <word, doc> | no reduce ->  map- <word doc, scrubbed word> |
reduce - <scrubbed word, list(docs)>

also, if you consider how hadoop streaming works you might just write a
script in python (or whatever) that does

stdin | map -everything you want to do in one script | reduce - aggregate
results of previous map script

because your data set is spread out in x number of blocks you may be able to
gain more parallelization speedup by simply doing everything you want in one
step and then aggregating it with a reduce.  Though this sidesteps the
mapReduce paradigm of <key, value>, it acheives the benefit of using hadoop
to handle the distribution of tasks and pieces of the file.



On Wed, Mar 26, 2008 at 12:19 PM, Arun C Murthy <arunc@yahoo-inc.com> wrote:

>
> On Mar 26, 2008, at 11:05 AM, Arun C Murthy wrote:
>
> >
> > On Mar 26, 2008, at 9:39 AM, Aayush Garg wrote:
> >
> >> HI,
> >> I am developing the simple inverted index program frm the hadoop.
> >> My map
> >> function has the output:
> >> <word, doc>
> >> and the reducer has:
> >> <word, list(docs)>
> >>
> >> Now I want to use one more mapreduce to remove stop and scrub
> >> words from
> >> this output. Also in the next stage I would like to have short summay
> >> associated with every word. How should I design my program from
> >> this stage?
> >> I mean how would I apply multiple mapreduce to this? What would be
> >> the
> >> better way to perform this?
> >>
> >
> > In general you are better off with lesser number of Map-Reduce
> > jobs ... lesser i/o works better.
> >
>
> I forgot to add that you can use the apis in JobClient and JobControl
> to chain jobs together ...
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Job
> +Control
> http://hadoop.apache.org/core/docs/current/
> mapred_tutorial.html#JobControl
>
> Arun
>
> > Use the DistributedCache if you can and fix your first Map to not
> > emit the stop words at all. Use the combiner to crunch down amount
> > of intermediate map-outputs etc.
> >
> > Something useful to look at:
> > http://hadoop.apache.org/core/docs/current/
> > mapred_tutorial.html#Example%3A+WordCount+v2.0
> >
> > Arun
> >
> >> Thanks,
> >>
> >> Regards,
> >> -
> >> Aayush Garg,
> >> Phone: +41 76 482 240
> >
>
>


-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message