Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 71487 invoked from network); 20 Sep 2008 01:29:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Sep 2008 01:29:42 -0000 Received: (qmail 87458 invoked by uid 500); 20 Sep 2008 01:29:34 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 87413 invoked by uid 500); 20 Sep 2008 01:29:34 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 87402 invoked by uid 99); 20 Sep 2008 01:29:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Sep 2008 18:29:34 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.142.190] (HELO ti-out-0910.google.com) (209.85.142.190) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 20 Sep 2008 01:28:32 +0000 Received: by ti-out-0910.google.com with SMTP id d27so373814tid.9 for ; Fri, 19 Sep 2008 18:29:04 -0700 (PDT) Received: by 10.110.33.15 with SMTP id g15mr1226681tig.54.1221874144116; Fri, 19 Sep 2008 18:29:04 -0700 (PDT) Received: by 10.110.62.6 with HTTP; Fri, 19 Sep 2008 18:29:04 -0700 (PDT) Message-ID: Date: Sat, 20 Sep 2008 10:29:04 +0900 From: "Edward J. Yoon" Sender: edward@udanax.org To: core-user@hadoop.apache.org Subject: Re: no speed-up with parallel matrix calculation In-Reply-To: <257c70550809191300p4485de9bxc4936283772c8120@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <257c70550809191300p4485de9bxc4936283772c8120@mail.gmail.com> X-Google-Sender-Auth: 22a4ad89d3a3d3e6 X-Virus-Checked: Checked by ClamAV on apache.org > Has anyone implemented a matrix in parallel using MapReduce? See this project : http://wiki.apache.org/hama On Sat, Sep 20, 2008 at 5:00 AM, Sandy wrote: > Hi, > > I have a serious problem that I'm not sure how to fix. I have two M/R phases > that calculates a matrix in parallel. It works... but it's slower than the > serial version (by about 100 times). > > Here is a toy program that works similarly to my application. In this > example I'm having different random numbers being generated, per given line > of input, and then creating a n x n matrix that counts how many of these > random numbers were shared. > > ------------------- > first map phase() { > input: key = offset, value = line of text (embedded line number), ln > generate k random numbers, k1 .. kn > emit: > } > > first reduce phase() { > input: key = ki, value = list(ln) > if list size is greater than one: > for every 2-permutation p: > emit: > //example: list = 1 2 3 > //emit: <(1,2), 1> > //emit: <(2,3), 1> > //emit: <(1,3), 1> > } > > second map phase() { > input: key = offset, value = (i, j) 1 > //dummy function. acts as a transition to reduce > parse value into two tokens [(i,j)] and [1] > emit: <(i,j), 1> > } > > second reduce() { > input: key = (i,j) value = list(1) > //wordcount > sum up the list of ones > emit: <(i,j), sum(list(1))> > } > ------------------ > > Now here's the problem: > > Let's suppose the file is 27MB. > The time it takes for the first map phase is about 3 minutes. > The time it takes for the first reduce phase is about 1 hour. > The size of the intermediary files that are produced by this first M/R phase > is 48GB. > > The time it takes for the second map phase is 9 hours (and this function is > just a dummy funtion!!) > The time it takes for the second reduce phase is 12 hours > > I have been trying to change the number of maps and reduce tasks, but that > doesn't seem to really chip away at the massive number of 2-permutations > that need to be taken care of in the second M/R phase. At least not on my > current machine. > > > Has anyone implemented a matrix in parallel using MapReduce? If so, Is this > normal or expected behavior? I do realize that I have a small input file, > and that this may impact speedup. The most powerful machine I have to run > this M/R implementation is a MacPro that has two processors, each with four > cores, and 4 different hard disks of 1 TB each. > > Does anyone have any suggestions on what I can change (either with the > approach or the cluster setup -- do I need more machines?) in order to make > this faster? I am current running 8 map tasks and 4 reduce tasks. I am going > to change it 10 map tasks and 9 reduce tasks and see if that helps any, but > I'm seriously wondering if this is not going to give me much of a change > since I only have one machine. > > > Any insight is greatly appreciated. > > Thanks! > > -SM > -- Best regards, Edward J. Yoon edwardyoon@apache.org http://blog.udanax.org