Return-Path: Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: (qmail 35238 invoked from network); 3 Jan 2011 07:53:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Jan 2011 07:53:08 -0000 Received: (qmail 69594 invoked by uid 500); 3 Jan 2011 07:53:08 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 69390 invoked by uid 500); 3 Jan 2011 07:53:07 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 69382 invoked by uid 99); 3 Jan 2011 07:53:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Jan 2011 07:53:06 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dlieu.7@gmail.com designates 74.125.82.68 as permitted sender) Received: from [74.125.82.68] (HELO mail-ww0-f68.google.com) (74.125.82.68) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Jan 2011 07:52:58 +0000 Received: by wwj40 with SMTP id 40so4866763wwj.7 for ; Sun, 02 Jan 2011 23:52:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=p1KuiDwXKnu9vwusyHm9iL+/NOT5tXX/cl3AjD17PtA=; b=Ir9TvZLSMoJzZ5l/GZ1TAmUguq7g4QGKAlLTrwZXTyP9kPFs/zqy4HtCwp4eZn6Ehg YEQ7KKOogcuMZGEVMZgoYBSK+aa24Q0xp+4YIfpMg4ztHZ8zP/JVdxRmmCr5IO4pl7Q1 9Ck8Fw+acgl/NFW8Ic0pkPwaYejzLdWIK8OrE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=CRn241JuoV2Gw7L2kM3yFaDxxcGMhxrI8JP6fm9w2OKTLHlOoAqaco+IW6qUhzxMc0 h84tW5xfsn57Xv8Oq47IM3GyA8IK9CETQzhWL02GwBH9Y+aoTu8aTimA5fUu53Cz5i1o AetkN0tH2q6LJMUWAs4idYOeHCNZSB11tZDgk= MIME-Version: 1.0 Received: by 10.216.141.75 with SMTP id f53mr7502774wej.16.1294041157447; Sun, 02 Jan 2011 23:52:37 -0800 (PST) Received: by 10.216.242.193 with HTTP; Sun, 2 Jan 2011 23:52:37 -0800 (PST) In-Reply-To: References: <29631062.80141293787666856.JavaMail.jira@thor> <4D20DFF0.6010809@gatech.edu> Date: Sun, 2 Jan 2011 23:52:37 -0800 Message-ID: Subject: Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2 From: Dmitriy Lyubimov To: dev@mahout.apache.org Content-Type: multipart/alternative; boundary=0016e6de03be83b0040498ec6f16 X-Virus-Checked: Checked by ClamAV on apache.org --0016e6de03be83b0040498ec6f16 Content-Type: text/plain; charset=ISO-8859-1 On another note, Sean is absolutely correct, Amazon ElasticMR indeed seems to be stuck with 0.20 (or, rather, stuck with a particular hadoop setup without much flexibility here). I guess moving ahead with APIs in Mahout would indeed create problems for whoever is using EMR (I don't). On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov wrote: > I would think blockwise multiplication (which is, by the way, has a > standard algorithm in Matrix computations by Van Loan and Golub), is pretty > pointless with Mahout since there's no blockwise matrix format presently, > and even if it were, no existing algorithms support it. All prep utils only > produce row-wise format. We could write a routine to "block" it but it would > seem to be an exercise in futility. > > Second remark is that blockwise multiplication is also pointless for > sufficiently sparse matrices. Indeed, sum of outer products of columns and > rows with intermediate reduction in combiners is by far most promising in > terms of shuffle/sort io. Outer products, when further split in columns or > rows, would also be quite sparse and hence small in size while reduction > in keyset cardinality is just gigantic compared to blockwise > multiplications. (That said, i never ran comparison benchmark of the two) > > Note that what authors essentially are suggesting (even in strategy 4) > that there is explosive growth of shuffle and sort keyset i/o, and what's > more, they say they never tried it in distributed mode(!). imagine hundreds > of machines sending a copy of their input to a lot of other machines in the > cluster. Summing outer products avoids broadcasting the input to multiple > reducers. > > On another note, if input is similarly partitioned (not always the case), > then map-side multiplication will always be I/O superior to reduce-side > multiplication since while I/O is less and especially less in the keyset > cardinality undergoing thru sorters. The power of map-side operations comes > from the notion that yes we require a lot from the input but no, it's not a > lot if input is already part of a bigger MR pipeline. > > Finally, back to 0.20/0.21 issue... I said before in this thread that > migrating to 0.21 would render Mahout incompatible with majority of > production frameworks out there. But after working with ssvd code, i came to > think of a compromise: since most of the production environments are running > Cloudera distribution, many 0.21 things are supported there and there's a > lot of code around that's written for new API which is backported in > Cloudera. It's difficult for me to judge how much Cloudera's implementation > covers of what is in 0.21 (in fact, i did come across a couple of 0.21 > things still missing in CDH), but in terms of Hadoop compatibility, i think > Mahout project would be best served if it indeed moved on to a new api (i.e. > 0.21 ) but would not get ahead of what is supported in CDH3. That would keep > it on the edge of what's currently practical and out there. Keeping sitting > on the old api IMO is definitely a drag. My stochastic svd code is using new > api in CDH3 and i would very much not want to backport it to old api, it > would not be practical as everyone out there is on CDH and more so than on > 0.20.2. > > -Dmitriy > > > >> Some more general remarks: I think the matrix multiplication can be >>> implemented more efficiently. I've done a matrix multiplication of a sparse >>> 500kx15k matrix with around 35 million elements on a quite powerful cluster >>> of 10 nodes, and this took around 30 minutes. I have no idea of the >>> performance of the implementation described at >>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't >>> really compare. But Imho this can be improved ( though it's possible that >>> the poor performance was due to mistakes made by me ) >>> >> I will definitely investigate these methods over the coming days, these >> look fantastic. >> >> Shannon >> > > --0016e6de03be83b0040498ec6f16--