Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-ID: <eb4706e0809191829t5b9186fdk861a31aaa08c0831@mail.gmail.com>
Date: Sat, 20 Sep 2008 10:29:04 +0900
From: "Edward J. Yoon" <edwardyoon@apache.org>
Sender: edward@udanax.org
To: core-user@hadoop.apache.org
Subject: Re: no speed-up with parallel matrix calculation
In-Reply-To: <257c70550809191300p4485de9bxc4936283772c8120@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <257c70550809191300p4485de9bxc4936283772c8120@mail.gmail.com>

> Has anyone implemented a matrix in parallel using MapReduce?

See this project : http://wiki.apache.org/hama

On Sat, Sep 20, 2008 at 5:00 AM, Sandy <snickerdoodle08@gmail.com> wrote:
> Hi,
>
> I have a serious problem that I'm not sure how to fix. I have two M/R phases
> that calculates a matrix in parallel. It works... but it's slower than the
> serial version (by about 100 times).
>
> Here is a toy program that works similarly to my application. In this
> example I'm having different random numbers being generated, per given line
> of input, and then creating a n x n matrix that counts how many of these
> random numbers were shared.
>
> -------------------
> first map phase() {
> input: key = offset, value = line of text (embedded line number), ln
> generate k random numbers, k1 .. kn
> emit: <ki, ln >
> }
>
> first reduce phase() {
> input: key = ki, value = list(ln)
> if list size is greater than one:
>   for every 2-permutation p:
>      emit: <p, 1>
>    //example: list = 1 2 3
>    //emit: <(1,2), 1>
>    //emit: <(2,3), 1>
>    //emit: <(1,3), 1>
> }
>
> second map phase() {
> input: key = offset, value = (i, j) 1
> //dummy function. acts as a transition to reduce
> parse value into two tokens [(i,j)] and [1]
> emit: <(i,j), 1>
> }
>
> second reduce() {
> input: key = (i,j)  value = list(1)
> //wordcount
> sum up the list of ones
> emit: <(i,j), sum(list(1))>
> }
> ------------------
>
> Now here's the problem:
>
> Let's suppose the file is 27MB.
> The time it takes for the first map phase is about 3 minutes.
> The time it takes for the first reduce phase is about 1 hour.
> The size of the intermediary files that are produced by this first M/R phase
> is 48GB.
>
> The time it takes for the second map phase is 9 hours (and this function is
> just a dummy funtion!!)
> The time it takes for the second reduce phase is 12 hours
>
> I have been trying to change the number of maps and reduce tasks, but that
> doesn't seem to really chip away at the massive number of 2-permutations
> that need to be taken care of in the second M/R phase. At least not on my
> current machine.
>
>
> Has anyone implemented a matrix in parallel using MapReduce? If so, Is this
> normal or expected behavior? I do realize that I have a small input file,
> and that this may impact speedup. The most powerful machine I have to run
> this M/R implementation is a MacPro that has two processors, each with four
> cores, and 4 different hard disks of 1 TB each.
>
> Does anyone have any suggestions on what I can change (either with the
> approach or the cluster setup -- do I need more machines?) in order to make
> this faster? I am current running 8 map tasks and 4 reduce tasks. I am going
> to change it 10 map tasks and 9 reduce tasks and see if that helps any, but
> I'm seriously wondering if this is not going to give me much of a change
> since I only have one machine.
>
>
> Any insight is greatly appreciated.
>
> Thanks!
>
> -SM
>


-- 
Best regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org