Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 960101170F for ; Wed, 4 Jun 2014 14:52:19 +0000 (UTC) Received: (qmail 30687 invoked by uid 500); 4 Jun 2014 14:52:19 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 30615 invoked by uid 500); 4 Jun 2014 14:52:19 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 30604 invoked by uid 99); 4 Jun 2014 14:52:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jun 2014 14:52:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.223.171 as permitted sender) Received: from [209.85.223.171] (HELO mail-ie0-f171.google.com) (209.85.223.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jun 2014 14:52:13 +0000 Received: by mail-ie0-f171.google.com with SMTP id to1so7316178ieb.2 for ; Wed, 04 Jun 2014 07:51:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=HhDSpkclN5U8lqGtm1XWqa7X8O6PpvySXOqxZtpctSE=; b=ACdNgwbpwCgpdSf4Won6aQXs5drm5bM9XFF7EvYsgXDm9hroJ3FrMTNpwAmcIXLSdn gnPU+SwAyoPYerZq4DrkCBLsmOmLsStlNaQ1It3Ybh7sK828uBln23TYlHKGBrw9VLSz 1TI5W/j0NuU9l80ecZBEtAYZAXk9IwUOEM3G0LIJV/f709zeqZt1PLtQ6hf8G2vlKTwG CX27NHqoDqFz3lr3cqfJWaONZK4Gn5AKUXyIqNk5WZUQavjx7p8uKLF1R5jlriXDhlaD mSbmwgxFJFp4vyrJbh3DVGrau9V1DOm4MDSZcdcNzzy/fMoTqormZ0mYCajSpxVRDDRR lzwQ== X-Received: by 10.50.128.162 with SMTP id np2mr8186074igb.22.1401893513125; Wed, 04 Jun 2014 07:51:53 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.13.113 with HTTP; Wed, 4 Jun 2014 07:51:23 -0700 (PDT) In-Reply-To: <538ED1F2.5070006@apache.org> References: <538ED1F2.5070006@apache.org> From: Ted Dunning Date: Wed, 4 Jun 2014 07:51:23 -0700 Message-ID: Subject: Re: SparkBindings on a real cluster To: Mahout Dev List , Sebastian Schelter Content-Type: multipart/alternative; boundary=047d7b10ce8fdce6da04fb03c49c X-Virus-Checked: Checked by ClamAV on apache.org --047d7b10ce8fdce6da04fb03c49c Content-Type: text/plain; charset=UTF-8 Great list of issues. On Wed, Jun 4, 2014 at 12:59 AM, Sebastian Schelter wrote: > Hi, > > I did some experimentation with the spark bindings on a real cluster > yesterday, as I had to run some experiments for a paper (unrelated to > Mahout) that I'm currently writing. The experiment basically consists of > multiplying a sparse data matrix by a super-sparse permutation-like matrix > from the left. It took me the whole day to get it working, up to matrices > with 500M entries. > > I ran into lots of issues that we have to fix asap, unfortunately I don't > have much time in the next weeks, so I'm just sharing a list of the issues > that I ran into (maybe I'll find some time to create issues for these > things on the weekend). > > I think the major challenge for us will be to get choice of dense/sparse > correct and put lots of work into memory efficiency. This could be a great > hook for collaborating with the h20 folks, as they know how to make > vector-like data small and computations fast. > > Here's the list: > > * our matrix serialization in MatrixWritable is seriously flawed, I ran > into the following errors > > - the type information is stored with every vector although a matrix > always only contains vectors of the same type > - all entries of a TransposeView (and possibly other views) of a sparse > matrix are serialized, resulting in OOM > - for sparse row matrices, the vectors are set using assign instead of > via constructor injection, this results in huge memory consumption and long > creation times, as in some implementations, binary search is used for > assignment > > * a dense matrix is converted into a SparseRowMatrix with dense row > vectors by blockify(), after serialization this becomes a dense matrix in > sparse format (triggering OOMs)! > > * drmFromHDFS does not have an option to set the number of desired > partitions > > * SparseRowMatrix with sequential vectors times SparseRowMatrix with > sequential vectors is totally broken, it uses three nested loops and uses > get(row, col) on the matrices, which internally uses binary search... > > * At operator adds the column vectors it creates, this is unnecessary as > we don't need the addition, we can just merge the vectors > > * we need a dedicated operator for inCoreA %*% drmB, currently this gets > rewritten to (drmB.t %*%* inCoreA.t).t which is highly inefficient (I have > a prototype of that operator) > > Best, > Sebastian > > > --047d7b10ce8fdce6da04fb03c49c--