hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy <snickerdoodl...@gmail.com>
Subject no speed-up with parallel matrix calculation
Date Fri, 19 Sep 2008 20:00:17 GMT
Hi,

I have a serious problem that I'm not sure how to fix. I have two M/R phases
that calculates a matrix in parallel. It works... but it's slower than the
serial version (by about 100 times).

Here is a toy program that works similarly to my application. In this
example I'm having different random numbers being generated, per given line
of input, and then creating a n x n matrix that counts how many of these
random numbers were shared.

-------------------
first map phase() {
input: key = offset, value = line of text (embedded line number), ln
generate k random numbers, k1 .. kn
emit: <ki, ln >
}

first reduce phase() {
input: key = ki, value = list(ln)
if list size is greater than one:
   for every 2-permutation p:
      emit: <p, 1>
    //example: list = 1 2 3
    //emit: <(1,2), 1>
    //emit: <(2,3), 1>
    //emit: <(1,3), 1>
}

second map phase() {
input: key = offset, value = (i, j) 1
//dummy function. acts as a transition to reduce
parse value into two tokens [(i,j)] and [1]
emit: <(i,j), 1>
}

second reduce() {
input: key = (i,j)  value = list(1)
//wordcount
sum up the list of ones
emit: <(i,j), sum(list(1))>
}
------------------

Now here's the problem:

Let's suppose the file is 27MB.
The time it takes for the first map phase is about 3 minutes.
The time it takes for the first reduce phase is about 1 hour.
The size of the intermediary files that are produced by this first M/R phase
is 48GB.

The time it takes for the second map phase is 9 hours (and this function is
just a dummy funtion!!)
The time it takes for the second reduce phase is 12 hours

I have been trying to change the number of maps and reduce tasks, but that
doesn't seem to really chip away at the massive number of 2-permutations
that need to be taken care of in the second M/R phase. At least not on my
current machine.


Has anyone implemented a matrix in parallel using MapReduce? If so, Is this
normal or expected behavior? I do realize that I have a small input file,
and that this may impact speedup. The most powerful machine I have to run
this M/R implementation is a MacPro that has two processors, each with four
cores, and 4 different hard disks of 1 TB each.

Does anyone have any suggestions on what I can change (either with the
approach or the cluster setup -- do I need more machines?) in order to make
this faster? I am current running 8 map tasks and 4 reduce tasks. I am going
to change it 10 map tasks and 9 reduce tasks and see if that helps any, but
I'm seriously wondering if this is not going to give me much of a change
since I only have one machine.


Any insight is greatly appreciated.

Thanks!

-SM

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message