mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: GPU, lapack Matrix adaptations
Date Fri, 15 Aug 2014 16:55:24 GMT
Sorry, this should say

it is true that sparse algebra is by far more compelling than dense one;


On Fri, Aug 15, 2014 at 9:54 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> As i indicated, i think it is a worthy move. As i said before (including
> Spark list), it is true that dense algebra is by far more compelling than
> dense one; however, there are some considerations that make this work very
> much worthwhile. to sum up my motivations:
>
> (1) even in methods currently put in, dense multiplications and
> decompositions are happening, and may actually speed up things in certain
> cases.
>
> (2) since main idea is ease of customization, it should be fairly low
> consideration of how it may be useful for what's already inside, but rather
> for potential use. I have developed internally methods using that algebra
> that by sheer number outnumber those present in Mahout. Assuming other
> power users will do the same (which is still largely just a hope at this
> point), we'd be just looking like cavemen if we do not provide jCuda and
> jBlas bindings.
>
> so that sums the motivation.
>
> Re: pull request. So that's a good start.
>
> As was mentioned in previous discussions, we are lacking cost-based
> optimizer for binary matrix operators the same way it was done for vectors.
>
> E.g. we need some sort of generic entry point into matrix-matrix
> operations that will make specific algorithm selection based on operand
> types. For sparse types, some algorithms were already added by Ted but they
> were not connected to this decision tree properly. For dense types, we
> probably will need to run some empiric cost calibration analysis (i.e. if
> arg A has native T multiplication and arg B does not, will it be faster to
> convert B to native T and proceed natively, or vice versa, given geometry
> and number of elements, etc. etc.) Imo this stuff has pretty unique
> architectural opportunities for matrix centric operations.
>
> On another note, i think it is not worthwhile to support lapack/cuda
> operation for vectors.
>
>
>
>
> On Wed, Aug 13, 2014 at 3:39 PM, Anand Avati <avati@gluster.org> wrote:
>
>> On Fri, Jul 18, 2014 at 12:01 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>> wrote:
>>
>> > On Fri, Jul 18, 2014 at 11:54 AM, Anand Avati <avati@gluster.org>
>> wrote:
>> >
>> > > On Fri, Jul 18, 2014 at 11:42 AM, Dmitriy Lyubimov <dlieu.7@gmail.com
>> >
>> > > wrote:
>> > >
>> > >
>> > > Co incidentally I was wildly imagining/exploring integration with the
>> > > fortran blas behind the in-core DSL using jni. I had not come across
>> > these
>> > > BIDData projects. I'm happy to reorient that effort towards exploring
>> > > these.
>> > >
>> >
>> > Well, it's both. JBlas & JCublas. should be too expensive.
>> >
>> > if i had to choose, i'd say integrate jCublas first, seems to be a bit
>> of
>> > an edge here. We already know from Sebastien's work with jblas that its
>> > integration for sparse methods is not that interesting.
>> >
>> > However, even  vector-vector operations over views of gpu-stored data
>> > become somewhat interesting in context of dsl operators.
>> >
>>
>> FYI, I was toying around a jBLAS backend for Matrix / Vector (at
>> https://github.com/apache/mahout/pull/44). Started with jBLAS only
>> because
>> I found better documentation. Testing on my laptop a 1024x1024 matrix
>> multiplication of random numbers, found a solid 56x faster runtime:
>>
>> Run starting. Expected test count is: 1
>> DiscoverySuite:
>> JBlasSuite:
>> Normal multiplication (ms) = 15900
>> jBLAS multiplication (ms) = 284
>> - matrix multiplication
>> Run completed in 16 seconds, 793 milliseconds.
>>
>>
>> This is a very trivial implementation with only matrix multiplication
>> optimized. Better vector integration is possible along the same steps.
>> However for deeper integration (for e.g transparent offloading of
>> decompositions into jblas), some restructuring of API will make it simple
>> and easy for consumers. For example what I mean is - instead of public
>> CholeskyDecomposition(Matrix
>> A) constructor, have public CholeskyDecomposition choleskydecompose() in
>> Matrix interface. This way JBlasMatrix can transparently insert its own
>> optimized decomp code and return it as an inherited object of
>> CholeskyDecomposition class.
>>
>> Comments/feedback welcome.
>>
>> I also discovered there are other common code refactoring which can be
>> done
>> (iterator, non-zero iterator code etc repeated many places) - separate PRs
>> for them.
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message