> Has anyone implemented a matrix in parallel using MapReduce?
See this project : http://wiki.apache.org/hama
On Sat, Sep 20, 2008 at 5:00 AM, Sandy <snickerdoodle08@gmail.com> wrote:
> Hi,
>
> I have a serious problem that I'm not sure how to fix. I have two M/R phases
> that calculates a matrix in parallel. It works... but it's slower than the
> serial version (by about 100 times).
>
> Here is a toy program that works similarly to my application. In this
> example I'm having different random numbers being generated, per given line
> of input, and then creating a n x n matrix that counts how many of these
> random numbers were shared.
>
> 
> first map phase() {
> input: key = offset, value = line of text (embedded line number), ln
> generate k random numbers, k1 .. kn
> emit: <ki, ln >
> }
>
> first reduce phase() {
> input: key = ki, value = list(ln)
> if list size is greater than one:
> for every 2permutation p:
> emit: <p, 1>
> //example: list = 1 2 3
> //emit: <(1,2), 1>
> //emit: <(2,3), 1>
> //emit: <(1,3), 1>
> }
>
> second map phase() {
> input: key = offset, value = (i, j) 1
> //dummy function. acts as a transition to reduce
> parse value into two tokens [(i,j)] and [1]
> emit: <(i,j), 1>
> }
>
> second reduce() {
> input: key = (i,j) value = list(1)
> //wordcount
> sum up the list of ones
> emit: <(i,j), sum(list(1))>
> }
> 
>
> Now here's the problem:
>
> Let's suppose the file is 27MB.
> The time it takes for the first map phase is about 3 minutes.
> The time it takes for the first reduce phase is about 1 hour.
> The size of the intermediary files that are produced by this first M/R phase
> is 48GB.
>
> The time it takes for the second map phase is 9 hours (and this function is
> just a dummy funtion!!)
> The time it takes for the second reduce phase is 12 hours
>
> I have been trying to change the number of maps and reduce tasks, but that
> doesn't seem to really chip away at the massive number of 2permutations
> that need to be taken care of in the second M/R phase. At least not on my
> current machine.
>
>
> Has anyone implemented a matrix in parallel using MapReduce? If so, Is this
> normal or expected behavior? I do realize that I have a small input file,
> and that this may impact speedup. The most powerful machine I have to run
> this M/R implementation is a MacPro that has two processors, each with four
> cores, and 4 different hard disks of 1 TB each.
>
> Does anyone have any suggestions on what I can change (either with the
> approach or the cluster setup  do I need more machines?) in order to make
> this faster? I am current running 8 map tasks and 4 reduce tasks. I am going
> to change it 10 map tasks and 9 reduce tasks and see if that helps any, but
> I'm seriously wondering if this is not going to give me much of a change
> since I only have one machine.
>
>
> Any insight is greatly appreciated.
>
> Thanks!
>
> SM
>

Best regards, Edward J. Yoon
edwardyoon@apache.org
http://blog.udanax.org
