Thank you for the link, Edward. I'll take a look at HAMA.
Does anyone knwo if there is a way to limit the upper bound of maps being
produced? I see now that mapred.tasktracker.tasks.maximum really does not
limit the number of maps, as the number of maps is determined by
InputFormat. Aside from modifying the number of lines that InputSplit()
splits on, is there no way to limit the number of maps spawned?
Thanks again to everyone for all their help. I'm very grateful.
SM
On Fri, Sep 19, 2008 at 8:29 PM, Edward J. Yoon <edwardyoon@apache.org>wrote:
> > Has anyone implemented a matrix in parallel using MapReduce?
>
> See this project : http://wiki.apache.org/hama
>
> On Sat, Sep 20, 2008 at 5:00 AM, Sandy <snickerdoodle08@gmail.com> wrote:
> > Hi,
> >
> > I have a serious problem that I'm not sure how to fix. I have two M/R
> phases
> > that calculates a matrix in parallel. It works... but it's slower than
> the
> > serial version (by about 100 times).
> >
> > Here is a toy program that works similarly to my application. In this
> > example I'm having different random numbers being generated, per given
> line
> > of input, and then creating a n x n matrix that counts how many of these
> > random numbers were shared.
> >
> > 
> > first map phase() {
> > input: key = offset, value = line of text (embedded line number), ln
> > generate k random numbers, k1 .. kn
> > emit: <ki, ln >
> > }
> >
> > first reduce phase() {
> > input: key = ki, value = list(ln)
> > if list size is greater than one:
> > for every 2permutation p:
> > emit: <p, 1>
> > //example: list = 1 2 3
> > //emit: <(1,2), 1>
> > //emit: <(2,3), 1>
> > //emit: <(1,3), 1>
> > }
> >
> > second map phase() {
> > input: key = offset, value = (i, j) 1
> > //dummy function. acts as a transition to reduce
> > parse value into two tokens [(i,j)] and [1]
> > emit: <(i,j), 1>
> > }
> >
> > second reduce() {
> > input: key = (i,j) value = list(1)
> > //wordcount
> > sum up the list of ones
> > emit: <(i,j), sum(list(1))>
> > }
> > 
> >
> > Now here's the problem:
> >
> > Let's suppose the file is 27MB.
> > The time it takes for the first map phase is about 3 minutes.
> > The time it takes for the first reduce phase is about 1 hour.
> > The size of the intermediary files that are produced by this first M/R
> phase
> > is 48GB.
> >
> > The time it takes for the second map phase is 9 hours (and this function
> is
> > just a dummy funtion!!)
> > The time it takes for the second reduce phase is 12 hours
> >
> > I have been trying to change the number of maps and reduce tasks, but
> that
> > doesn't seem to really chip away at the massive number of 2permutations
> > that need to be taken care of in the second M/R phase. At least not on my
> > current machine.
> >
> >
> > Has anyone implemented a matrix in parallel using MapReduce? If so, Is
> this
> > normal or expected behavior? I do realize that I have a small input file,
> > and that this may impact speedup. The most powerful machine I have to run
> > this M/R implementation is a MacPro that has two processors, each with
> four
> > cores, and 4 different hard disks of 1 TB each.
> >
> > Does anyone have any suggestions on what I can change (either with the
> > approach or the cluster setup  do I need more machines?) in order to
> make
> > this faster? I am current running 8 map tasks and 4 reduce tasks. I am
> going
> > to change it 10 map tasks and 9 reduce tasks and see if that helps any,
> but
> > I'm seriously wondering if this is not going to give me much of a change
> > since I only have one machine.
> >
> >
> > Any insight is greatly appreciated.
> >
> > Thanks!
> >
> > SM
> >
>
>
>
> 
> Best regards, Edward J. Yoon
> edwardyoon@apache.org
> http://blog.udanax.org
>
