Hi,
I have a serious problem that I'm not sure how to fix. I have two M/R phases
that calculates a matrix in parallel. It works... but it's slower than the
serial version (by about 100 times).
Here is a toy program that works similarly to my application. In this
example I'm having different random numbers being generated, per given line
of input, and then creating a n x n matrix that counts how many of these
random numbers were shared.

first map phase() {
input: key = offset, value = line of text (embedded line number), ln
generate k random numbers, k1 .. kn
emit: <ki, ln >
}
first reduce phase() {
input: key = ki, value = list(ln)
if list size is greater than one:
for every 2permutation p:
emit: <p, 1>
//example: list = 1 2 3
//emit: <(1,2), 1>
//emit: <(2,3), 1>
//emit: <(1,3), 1>
}
second map phase() {
input: key = offset, value = (i, j) 1
//dummy function. acts as a transition to reduce
parse value into two tokens [(i,j)] and [1]
emit: <(i,j), 1>
}
second reduce() {
input: key = (i,j) value = list(1)
//wordcount
sum up the list of ones
emit: <(i,j), sum(list(1))>
}

Now here's the problem:
Let's suppose the file is 27MB.
The time it takes for the first map phase is about 3 minutes.
The time it takes for the first reduce phase is about 1 hour.
The size of the intermediary files that are produced by this first M/R phase
is 48GB.
The time it takes for the second map phase is 9 hours (and this function is
just a dummy funtion!!)
The time it takes for the second reduce phase is 12 hours
I have been trying to change the number of maps and reduce tasks, but that
doesn't seem to really chip away at the massive number of 2permutations
that need to be taken care of in the second M/R phase. At least not on my
current machine.
Has anyone implemented a matrix in parallel using MapReduce? If so, Is this
normal or expected behavior? I do realize that I have a small input file,
and that this may impact speedup. The most powerful machine I have to run
this M/R implementation is a MacPro that has two processors, each with four
cores, and 4 different hard disks of 1 TB each.
Does anyone have any suggestions on what I can change (either with the
approach or the cluster setup  do I need more machines?) in order to make
this faster? I am current running 8 map tasks and 4 reduce tasks. I am going
to change it 10 map tasks and 9 reduce tasks and see if that helps any, but
I'm seriously wondering if this is not going to give me much of a change
since I only have one machine.
Any insight is greatly appreciated.
Thanks!
SM
