Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of dlieu.7@gmail.com designates
 74.125.82.68 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=CRn241JuoV2Gw7L2kM3yFaDxxcGMhxrI8JP6fm9w2OKTLHlOoAqaco+IW6qUhzxMc0
         h84tW5xfsn57Xv8Oq47IM3GyA8IK9CETQzhWL02GwBH9Y+aoTu8aTimA5fUu53Cz5i1o
         AetkN0tH2q6LJMUWAs4idYOeHCNZSB11tZDgk=
MIME-Version: 1.0
In-Reply-To: <AANLkTi=ejPDZvUsCJ-ohTMmanfBUmN-oc0YLbGmnwG7b@mail.gmail.com>
References: <29631062.80141293787666856.JavaMail.jira@thor>
	<4D20DFF0.6010809@gatech.edu>
	<AANLkTi=ejPDZvUsCJ-ohTMmanfBUmN-oc0YLbGmnwG7b@mail.gmail.com>
Date: Sun, 2 Jan 2011 23:52:37 -0800
Message-ID: <AANLkTi=T51NWpf1K0os9WwxC0iQiw4J6OJgZDUeeT3Mm@mail.gmail.com>
Subject: Re: [jira] Commented: (MAHOUT-537) Bring DistributedRowMatrix into
 compliance with Hadoop 0.20.2
From: Dmitriy Lyubimov <dlieu.7@gmail.com>
To: dev@mahout.apache.org
Content-Type: multipart/alternative; boundary=0016e6de03be83b0040498ec6f16

--0016e6de03be83b0040498ec6f16
Content-Type: text/plain; charset=ISO-8859-1

On another note, Sean is absolutely correct, Amazon ElasticMR indeed seems
to be stuck with 0.20 (or, rather, stuck with a particular hadoop setup
without much flexibility here). I guess moving ahead with APIs in Mahout
would indeed create problems for whoever is using EMR (I don't).

On Sun, Jan 2, 2011 at 9:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> I would think blockwise multiplication (which is, by the way, has a
> standard algorithm in Matrix computations by Van Loan and Golub), is pretty
> pointless with Mahout since there's no blockwise matrix format presently,
> and even if it were, no existing algorithms support it. All prep utils only
> produce row-wise format. We could write a routine to "block" it but it would
> seem to be an exercise in futility.
>
> Second remark is that blockwise multiplication is also pointless for
> sufficiently sparse matrices. Indeed, sum of outer products of columns and
> rows with intermediate reduction in combiners is by far most promising in
> terms of shuffle/sort io. Outer products, when further split in columns or
> rows, would also be quite sparse and hence small in size while reduction
> in keyset cardinality is just gigantic compared to blockwise
> multiplications. (That said, i never ran comparison benchmark of the two)
>
> Note that what authors essentially are suggesting (even in strategy 4)
> that there is explosive growth of shuffle and sort keyset i/o, and what's
> more, they say they never tried it in distributed mode(!). imagine hundreds
> of machines sending a copy of their input to a lot of other machines in the
> cluster. Summing outer products avoids broadcasting the input to multiple
> reducers.
>
> On another note, if input is similarly partitioned (not always the case),
> then map-side multiplication will always be I/O superior to reduce-side
> multiplication since while I/O is less and especially less in the keyset
> cardinality undergoing thru sorters. The power of map-side operations comes
> from the notion that yes we require a lot from the input but no, it's not a
> lot if input is already part of a bigger MR pipeline.
>
> Finally, back to 0.20/0.21 issue... I said before in this thread that
> migrating to 0.21 would render Mahout incompatible with majority of
> production frameworks out there. But after working with ssvd code, i came to
> think of a compromise: since most of the production environments are running
> Cloudera distribution, many 0.21 things are supported there and there's a
> lot of code around that's written for new API which is backported in
> Cloudera. It's difficult for me to judge how much Cloudera's implementation
> covers of what is in 0.21 (in fact, i did come across a couple of 0.21
> things still missing in CDH), but in terms of Hadoop compatibility, i think
> Mahout project would be best served if it indeed moved on to a new api (i.e.
> 0.21 ) but would not get ahead of what is supported in CDH3. That would keep
> it on the edge of what's currently practical and out there. Keeping sitting
> on the old api IMO is definitely a drag. My stochastic svd code is using new
> api in CDH3 and i would very much not want to backport it to old api, it
> would not be practical as everyone out there is on CDH and more so than on
> 0.20.2.
>
> -Dmitriy
>
>
>
>>  Some more general remarks: I think the matrix multiplication can be
>>> implemented more efficiently. I've done a matrix multiplication of a sparse
>>> 500kx15k matrix with around 35 million elements on a quite powerful cluster
>>> of 10 nodes, and this took around 30 minutes. I have no idea of the
>>> performance of the implementation described at
>>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html, so I can't
>>> really compare. But Imho this can be improved ( though it's possible that
>>> the poor performance was due to mistakes made by me )
>>>
>> I will definitely investigate these methods over the coming days, these
>> look fantastic.
>>
>> Shannon
>>
>
>

--0016e6de03be83b0040498ec6f16--