Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of snickerdoodle08@gmail.com
 designates 74.125.44.29 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:mime-version:content-type;
        b=u28haIhAHldPv6eGr9fZJ0Dz+1L8eNl1cViLJeZEMuX5NT17GFqy/xK9x1G3f4uqvn
         suQONebH5bQR3JR/Ac9c5Z/bwxdLhVZiCG3HaUcJBn6v/faigA59b378LwEt+s/3kLQi
         VUq3hhgVBkhfnnR54jFVZIbuxCx8wsGCrtAP4=
Message-ID: <257c70550809191300p4485de9bxc4936283772c8120@mail.gmail.com>
Date: Fri, 19 Sep 2008 15:00:17 -0500
From: Sandy <snickerdoodle08@gmail.com>
To: core-user@hadoop.apache.org
Subject: no speed-up with parallel matrix calculation
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_10827_2943827.1221854417349"

------=_Part_10827_2943827.1221854417349
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hi,

I have a serious problem that I'm not sure how to fix. I have two M/R phases
that calculates a matrix in parallel. It works... but it's slower than the
serial version (by about 100 times).

Here is a toy program that works similarly to my application. In this
example I'm having different random numbers being generated, per given line
of input, and then creating a n x n matrix that counts how many of these
random numbers were shared.

-------------------
first map phase() {
input: key = offset, value = line of text (embedded line number), ln
generate k random numbers, k1 .. kn
emit: <ki, ln >
}

first reduce phase() {
input: key = ki, value = list(ln)
if list size is greater than one:
   for every 2-permutation p:
      emit: <p, 1>
    //example: list = 1 2 3
    //emit: <(1,2), 1>
    //emit: <(2,3), 1>
    //emit: <(1,3), 1>
}

second map phase() {
input: key = offset, value = (i, j) 1
//dummy function. acts as a transition to reduce
parse value into two tokens [(i,j)] and [1]
emit: <(i,j), 1>
}

second reduce() {
input: key = (i,j)  value = list(1)
//wordcount
sum up the list of ones
emit: <(i,j), sum(list(1))>
}
------------------

Now here's the problem:

Let's suppose the file is 27MB.
The time it takes for the first map phase is about 3 minutes.
The time it takes for the first reduce phase is about 1 hour.
The size of the intermediary files that are produced by this first M/R phase
is 48GB.

The time it takes for the second map phase is 9 hours (and this function is
just a dummy funtion!!)
The time it takes for the second reduce phase is 12 hours

I have been trying to change the number of maps and reduce tasks, but that
doesn't seem to really chip away at the massive number of 2-permutations
that need to be taken care of in the second M/R phase. At least not on my
current machine.


Has anyone implemented a matrix in parallel using MapReduce? If so, Is this
normal or expected behavior? I do realize that I have a small input file,
and that this may impact speedup. The most powerful machine I have to run
this M/R implementation is a MacPro that has two processors, each with four
cores, and 4 different hard disks of 1 TB each.

Does anyone have any suggestions on what I can change (either with the
approach or the cluster setup -- do I need more machines?) in order to make
this faster? I am current running 8 map tasks and 4 reduce tasks. I am going
to change it 10 map tasks and 9 reduce tasks and see if that helps any, but
I'm seriously wondering if this is not going to give me much of a change
since I only have one machine.


Any insight is greatly appreciated.

Thanks!

-SM

------=_Part_10827_2943827.1221854417349--