Hi Nakul & all the committers,
Till now I am half way through the literature. But, for now a couple of
things to mention, in SVD there are three stages
1. Bidiagonal reduction step
2. Computation of the singular values
3. Computation of the singular vectors
of these three, The* Bidiagonal reduction* step is very expensive, so is
our focus on this( when considering GPU, at times where handling with CPU
is infeasible).
About literature:
- I took some time to go through " A Stable and Fast Algorithm for
Updating the Singular Value Decomposition" by "Gu & Stanley", to understand
the numerical stability and round-off errors when we are partitioning the
matrix in this distributed algorithm. The author has assured that each
component computed will be of high absolute accuracy. And also, the
properties that the resultant matrix support do not have any conflicts with
parent matrix. [pdf
]
- "High performance bidiagonal reduction using the tile algorithms on
homogeneous multicore clusters ", by "Ltaief et. al", this paper has
focused on the first stage mainly and has discussed a good about tile
algorithms and their runtime implementations.(although off-topic, I read
this just to understand.) [pdf
]
- "A distributed and incremental svd algorithm for agglomerative data
analysis on large networks", by "Iwen & Ong", *Please go through* the
(a). TABLE 1, TABLE 2 . (b). APPENDIX A. RAW DATA FROM NUMERICAL
EXPERIMENTS. [pdf ]
Thanks,
Janardhan
On Wed, Jul 26, 2017 at 12:29 AM, Nakul Jindal wrote:
> Hi Janardhan,
>
> The images you've used as attachments haven't reached my inbox.
> Could you please send them to me directly, rather than through the dev
> mailing list.
> (Or upload it to a image hosting site like imgur and paste the links in the
> email)
>
> I would like to point out that my knowledge of machine learning is limited.
> Still, how would you want to test the algorithm?
>
>
> Sparse matrices in SystemML (in Spark Execution Mode)
> Sparse matrix support in SystemML is implemented at a block level. Each
> (potentially giant) matrix is stored as blocks (in Spark execution mode).
> The matrix itself doesn't store knowledge of whether it is sparse or not.
> Each block does. Each block of this giant matrix can be stored in dense or
> spare format, depending on how many non-zeroes are in that block.
> The sparsity of a block is recomputed every so often, and the block
> representation can be converted from dense to sparse and vice-versa.
> When operations are performed on blocks, internally, a sparse aware
> operation maybe performed, depending on how the blocks are stored.
>
> (One of the other contributors can clarify, if I've missed something or
> have said something wrong).
>
> Given this context, can you please think about whether missing sparse
> matrix support is still a problem?
>
>
> -Nakul
>
>
>
>
>
>
>
> On Tue, Jul 25, 2017 at 11:14 AM, Janardhan Pulivarthi <
> janardhan.pulivarthi@gmail.com> wrote:
>
> > Hi Nakul,
> >
> > Thanks for explaining me about pros and cons of the two approaches. For
> > now, I have gone through the paper carefully over a couple of days and
> > found the following interesting things.
> >
> > 1. This is the algorithm we implemented.
> >
> >
> > 2. In the above algorithm the input matrix A is approximated to another
> > matrix B with the following relation with the error of chi(p, i) [ as
> shown
> > in (3) ] which the author argues will be in an acceptable limit. So, we
> can
> > go with this algorithm.
> >
> >
> > But, one bad thing is that author is not sure about whether the algorithm
> > supports the sparse matrices or not. So, we may need to test it here.
> >
> > For the time being we need to test the present algorithm implemented by
> > Imran Younus again. Can you help me with the testing?
> >
> > Thanks,
> > Janardhan
> >
> >
> > On Fri, Jul 21, 2017 at 6:07 AM, Nakul Jindal wrote:
> >
> >> Hi Janardhan,
> >>
> >>
> >>
> >>
> >> How will GPU implementation help scale distributed SVD:
> >>
> >> Imran implemented an algorithm he found out about in the paper "A
> >> Distributed and Incremental SVD Algorithm for Agglomerative Data
> Analysis
> >> on Large Networks" (
> >> https://github.com/apache/systemml/pull/273/files#diff-488f0
> >> 6e290f7a54db2e125f7bc608971R27
> >> ).
> >> The idea there was to build up a distributed SVD using invocations of
> svd
> >> on your local machine. He tried to achieve the multi-level parallelism
> >> through the parfor construct.
> >> Each local invocation of svd was done using the Apache Commons Math
> >> library.
> >> If each invocation of this local svd can instead be done on a GPU, the
> >> overall wall time for the distributed version would be decreased.
> >>
> >> Users may not always have a GPU. In that case, the svd falls back to the
> >> Apache Comons Math implementation. But if they do and if we have a "svd"
> >> builtin function, then it would be easier to take advantage of the GPU.
> >>
> >>
> >>
> >>
> >>
> >>
> >> Problem with scalable svd in dml is due to spark backed issues,
> otherwise
> >> there is not problem scaling w/o a local svd():
> >>
> >> There maybe spark backend issues and more may come to light and more
> >> workloads are executed on SystemML.
> >> For any given operation - we can implement it as a DML bodied function
> or
> >> a
> >> builtin function.
> >> For DML Bodied functions:
> >> Pros:
> >> - The SystemML optimizer can be applied to it
> >> - Distribution of SVD is then taken care of by SystemML. It will
> generate
> >> and run the spark primitives needed.
> >>
> >> Cons:
> >> - Implementing SVD, whether in DML or C, is a fair amount of work
> >> - There would not be a straightforward call to the svd gpu library. In
> >> fact, each of the linear algebra primitives would be accelerated on the
> >> GPU, but not the entire operation itself. This would involve many more
> JNI
> >> calls.
> >>
> >> For builtin functions:
> >> Pros:
> >> - Use of GPU libraries (cuSolver) and CPU libraries (Apache Commons
> Math)
> >> can be made, these are already optimized (in case of the GPU)
> >> - If a better SVD implementation is available via a library, that can
> >> easily be plugged in.
> >>
> >> Cons:
> >> - Would have to come up with an algorithm to implement distributed SVD
> >> with
> >> spark primitives
> >>
> >> Pick your battle.
> >>
> >>
> >>
> >>
> >>
> >>
> >> Maybe we could try another algorithm for scalable svd() :
> >>
> >> Sure. But before you do that, it may be worth our while to understand
> what
> >> is exactly misleading about the paper that Imran talks about.
> >>
> >>
> >>
> >>
> >>
> >> -Nakul
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi <
> >> janardhan.pulivarthi@gmail.com> wrote:
> >>
> >> > Hi Nakul,
> >> >
> >> > Can you help me understand how gpu implementations help scale it.
> >> Whether
> >> > the user always have GPUs in use when using this fn or is it an
> optional
> >> > feature.
> >>
> >>
> >> > The problem with implementing the scalable svd() in dml is due to the
> >> spark
> >> > backend issues, otherwise there is no problem scaling w/o a local
> svd()
> >> > function.
> >> >
> >> > May be we could try another algorithm for the scalable svd( ), if the
> >> > present algorithm is misleading as Imran Younus pointed out.
> >> >
> >> > Thank you,
> >> > Janardhan
> >> >
> >>
> >
> >
>