Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 21703 invoked from network); 19 Sep 2008 20:00:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Sep 2008 20:00:52 -0000 Received: (qmail 63472 invoked by uid 500); 19 Sep 2008 20:00:44 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 63423 invoked by uid 500); 19 Sep 2008 20:00:44 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 63412 invoked by uid 99); 19 Sep 2008 20:00:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Sep 2008 13:00:44 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of snickerdoodle08@gmail.com designates 74.125.44.29 as permitted sender) Received: from [74.125.44.29] (HELO yx-out-2324.google.com) (74.125.44.29) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Sep 2008 19:59:45 +0000 Received: by yx-out-2324.google.com with SMTP id 31so106851yxl.29 for ; Fri, 19 Sep 2008 13:00:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=MBMHtOt0rkA+TmOoROsuaTDV3IxzsTqaRm568Qrs2+Y=; b=M0XNQOBfnzYxoRY9B0fcSDhj18ul4+vImjK8IDokCbnjsDczc3PZFB+zHqoamvQOG8 kYJHqtywno+8jOg3awpy+pAqLuBaVr92rD/vkp7wUv09XXDyOsEWRsifDu5Rwimo+4MP w4RuRWfvgOgYyKnX9iTBiilqb7r1G2pVIjZ2o= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=u28haIhAHldPv6eGr9fZJ0Dz+1L8eNl1cViLJeZEMuX5NT17GFqy/xK9x1G3f4uqvn suQONebH5bQR3JR/Ac9c5Z/bwxdLhVZiCG3HaUcJBn6v/faigA59b378LwEt+s/3kLQi VUq3hhgVBkhfnnR54jFVZIbuxCx8wsGCrtAP4= Received: by 10.90.55.9 with SMTP id d9mr559386aga.49.1221854417353; Fri, 19 Sep 2008 13:00:17 -0700 (PDT) Received: by 10.90.118.16 with HTTP; Fri, 19 Sep 2008 13:00:17 -0700 (PDT) Message-ID: <257c70550809191300p4485de9bxc4936283772c8120@mail.gmail.com> Date: Fri, 19 Sep 2008 15:00:17 -0500 From: Sandy To: core-user@hadoop.apache.org Subject: no speed-up with parallel matrix calculation MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_10827_2943827.1221854417349" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_10827_2943827.1221854417349 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi, I have a serious problem that I'm not sure how to fix. I have two M/R phases that calculates a matrix in parallel. It works... but it's slower than the serial version (by about 100 times). Here is a toy program that works similarly to my application. In this example I'm having different random numbers being generated, per given line of input, and then creating a n x n matrix that counts how many of these random numbers were shared. ------------------- first map phase() { input: key = offset, value = line of text (embedded line number), ln generate k random numbers, k1 .. kn emit: } first reduce phase() { input: key = ki, value = list(ln) if list size is greater than one: for every 2-permutation p: emit: //example: list = 1 2 3 //emit: <(1,2), 1> //emit: <(2,3), 1> //emit: <(1,3), 1> } second map phase() { input: key = offset, value = (i, j) 1 //dummy function. acts as a transition to reduce parse value into two tokens [(i,j)] and [1] emit: <(i,j), 1> } second reduce() { input: key = (i,j) value = list(1) //wordcount sum up the list of ones emit: <(i,j), sum(list(1))> } ------------------ Now here's the problem: Let's suppose the file is 27MB. The time it takes for the first map phase is about 3 minutes. The time it takes for the first reduce phase is about 1 hour. The size of the intermediary files that are produced by this first M/R phase is 48GB. The time it takes for the second map phase is 9 hours (and this function is just a dummy funtion!!) The time it takes for the second reduce phase is 12 hours I have been trying to change the number of maps and reduce tasks, but that doesn't seem to really chip away at the massive number of 2-permutations that need to be taken care of in the second M/R phase. At least not on my current machine. Has anyone implemented a matrix in parallel using MapReduce? If so, Is this normal or expected behavior? I do realize that I have a small input file, and that this may impact speedup. The most powerful machine I have to run this M/R implementation is a MacPro that has two processors, each with four cores, and 4 different hard disks of 1 TB each. Does anyone have any suggestions on what I can change (either with the approach or the cluster setup -- do I need more machines?) in order to make this faster? I am current running 8 map tasks and 4 reduce tasks. I am going to change it 10 map tasks and 9 reduce tasks and see if that helps any, but I'm seriously wondering if this is not going to give me much of a change since I only have one machine. Any insight is greatly appreciated. Thanks! -SM ------=_Part_10827_2943827.1221854417349--