systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <mboe...@gmail.com>
Subject Re: distributed cholesky on systemml
Date Tue, 24 Apr 2018 03:58:15 GMT
you're welcome. Let me just make some additional remarks for other
people who might come over this thread. First, please don't
extrapolate characteristics from an experimental script and
unoptimized operations such as cholesky to SystemML in general.
Second, for compute-intensive algorithms, it's usually a good idea for
performance to leverage native BLAS libraries and our GPU backend
which are still disabled by default. Third, while data-parallel
operations indeed translate to individual RDD operations, there are
also language constructs like parfor (parallel for loops), where the
entire loop with all its iterations, nested loops and function calls
may be mapped to a single Spark job.

Regards,
Matthias


On Mon, Apr 23, 2018 at 12:42 AM, Qifan Pu <qifan.pu@gmail.com> wrote:
> You are right. Switching to Spark mode makes it even much slower...
> It seems that each operator will generate a Spark task converting things
> into RDD operators.
> Thanks so much for the patience and detailed instructions. I have a much
> better understanding of the system now.
>
> On Sun, Apr 22, 2018 at 7:47 PM, Matthias Boehm <mboehm7@gmail.com> wrote:
>>
>> well, SystemML decides the execution type of each operation based on
>> its worst-case memory estimate and the available driver memory budget
>> to avoid unnecessary overheads for distributed operations. Maybe the
>> size of the matrix that is fed into cholesky is smaller than the
>> intermediates used for data generation? Btw, this is a characteristic
>> we often see for other scripts (e.g., t(X)%*%X in LinearRegDS
>> potentially runs over large matrices whereas the subsequent solve only
>> works over a features-by-features matrix) - this was the reason why
>> distributed operations for builtin functions like solve and cholesky
>> had low priority in the past.
>>
>> If you're interested in understanding the execution plan of your
>> scenarios better, you can use '-explain' or '-explain hops' to see the
>> memory budgets, estimates, and selected execution types. Also, if you
>> want to force all operations to spark, you can do so via '-exec
>> spark'. However, note that this should be done only for debugging
>> because forcing all operations to spark can be very counterproductive
>> for performance.
>>
>> Regards,
>> Matthias
>>
>> On Sun, Apr 22, 2018 at 5:07 PM, Qifan Pu <qifan.pu@gmail.com> wrote:
>> > and everything before (e.g., I generate the matrix using another DML)
>> > was
>> > indeed run by Spark and shows up on the UI.
>> >
>> > On Sun, Apr 22, 2018 at 5:05 PM, Qifan Pu <qifan.pu@gmail.com> wrote:
>> >>
>> >> Thanks Jeremy and Matthias. When I run the script, the cholesky or the
>> >> inv
>> >> is executed completely on the driver, and nothing shows up on Spark UI.
>> >> Is that the expected behavior?
>> >>
>> >> On Sun, Apr 22, 2018 at 3:34 PM, Jeremy Nilmeier <nilmeier@us.ibm.com>
>> >> wrote:
>> >>>
>> >>> Yes, I also spoke with Sasha about this some time last year.  Thanks
>> >>> for
>> >>> following up.
>> >>>
>> >>> Cheers, J
>> >>>
>> >>>
>> >>> Jerome Nilmeier, PhD
>> >>> Data Scientist and Engineer
>> >>> IBM Spark Technology Center
>> >>> http://www.spark.tc/
>> >>>
>> >>>
>> >>>
>> >>> ----- Original message -----
>> >>> From: Matthias Boehm <mboehm7@gmail.com>
>> >>> To: dev@systemml.apache.org
>> >>> Cc: Qifan Pu <qifan.pu@gmail.com>, Jeremy Nilmeier
>> >>> <nilmeier@us.ibm.com>
>> >>> Subject: Re: distributed cholesky on systemml
>> >>> Date: Sun, Apr 22, 2018 2:41 PM
>> >>>
>> >>> thanks for the context Jeremy - that helps. I also had an offline
>> >>> conversion with Sasha and he pointed me to a script that does exactly
>> >>> that (iterative invert_lower_triangular) combined with a parfor over
>> >>> independent blocks. We'll merge these scripts soon and I'll reach out
>> >>> individually as necessary. Thanks everybody for now.
>> >>>
>> >>> Regards,
>> >>> Matthias
>> >>>
>> >>> On Sun, Apr 22, 2018 at 12:40 PM, Jeremy Nilmeier
>> >>> <nilmeier@us.ibm.com>
>> >>> wrote:
>> >>> > This may be a duplicate...it was bounced from the dev list.
>> >>> >
>> >>> > I think that scalable triangular inverse will also have similar
>> >>> > properties,
>> >>> > in that there is a sequential approach if it uses back substitution.
>> >>> >
>> >>> > For most of these algorithms (LU, Cholesky, QR), they are inherently
>> >>> > sequential, and the focus of the work is on minimizing interprocess
>> >>> > communication during the operations, which may explain why there
was
>> >>> > only
>> >>> > limited interest in pursuing this further.
>> >>> >
>> >>> > I had originally recommended that the recursive algorithms be
>> >>> > rewritten
>> >>> > as
>> >>> > iterative algorithms (and in fact provided an example of the LU
in
>> >>> > iterative
>> >>> > form), which would make the counting of operations more transparent,
>> >>> > as
>> >>> > well
>> >>> > as revealing possible parallelization points.
>> >>> >
>> >>> > Cheers, J
>> >>> > Jerome Nilmeier, PhD
>> >>> > Data Scientist and Engineer
>> >>> > IBM Spark Technology Center
>> >>> >
>> >>> >
>> >>> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spark.tc_&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=3mYOfURw_FSirAnoSv2pWvLSi1psso4F9RdGjEWL6yc&m=VIdNVaIRvibBlaNVAOXLKmxXf7ma-EXrLWbjMd9Bmgo&s=YktpBBbqor3DKzS90Ah75BF6NBYtE4RauITF7QaL87g&e=
>> >>>
>> >>> >
>> >>> >
>> >>> >
>> >>> > ----- Original message -----
>> >>> > From: Matthias Boehm <mboehm7@gmail.com>
>> >>> > To: dev@systemml.apache.org
>> >>> > Cc: Qifan Pu <qifan.pu@gmail.com>
>> >>> > Subject: Re: distributed cholesky on systemml
>> >>> > Date: Sun, Apr 22, 2018 1:21 AM
>> >>> >
>> >>> > sure no problem - thanks again for catching this issue that was
>> >>> > hidden
>> >>> > for a while.
>> >>> >
>> >>> > Yes, the same depth-first characteristic applies to the Cholesky
>> >>> > function as well. In contrast to U_triangular_inv, however, there
>> >>> > are
>> >>> > data dependencies between the blocks per level (at least in the
>> >>> > current algorithm formulation), which means we cannot use the
>> >>> > approach
>> >>> > I described for U_triangular_inv.
>> >>> >
>> >>> > L11 = Cholesky(A11, nb)
>> >>> > A22 = ... U_triangular_inv(t(L11))
>> >>> > L22 = Cholesky(A22, nb)
>> >>> >
>> >>> > However, note that there are much fewer calls to Cholesky due to
the
>> >>> > switch to the builtin cholesky according to the given min block
>> >>> > size.
>> >>> > For example, in our new test for dimensions 1362 x 1362 and min
size
>> >>> > of 200, we call Cholesky 15 times but U_triangular_inv 2539 times.
>> >>> >
>> >>> > For sufficiently large min block size this might be ok for Cholesky,
>> >>> > because each level also does a number of matrix multiplies that
will
>> >>> > exploit the available parallelism of your cluster. In that regard.
>> >>> > you
>> >>> > might want to experiment with different block sizes and driver
>> >>> > memory
>> >>> > budgets. If I get a chance, I will also run a number of experiments
>> >>> > and see if we can rewrite these scripts.
>> >>> >
>> >>> > Regards,
>> >>> > Matthias
>> >>> >
>> >>> > On Sun, Apr 22, 2018 at 12:48 AM, Qifan Pu <qifan.pu@gmail.com>
>> >>> > wrote:
>> >>> >> Matthias,
>> >>> >>
>> >>> >> Thanks so much for taking time to fix. Really appreciated it.
>> >>> >> Does the same reasoning apply to the cholesky script? The recursive
>> >>> >> approach
>> >>> >> also looks inherently sequential.
>> >>> >>
>> >>> >> Best,
>> >>> >> Qifan
>> >>> >>
>> >>> >> On Sat, Apr 21, 2018 at 11:39 PM, Matthias Boehm
>> >>> >> <mboehm7@gmail.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> just as a quick update: this issue has now been fixed in
SystemML
>> >>> >>> master - it was essentially a missing guard for recursive
>> >>> >>> functions
>> >>> >>> when checking for unary size-preserving functions during
>> >>> >>> inter-procedural analysis (IPA).
>> >>> >>>
>> >>> >>> However, while working with this recursive cholesky function
I
>> >>> >>> came
>> >>> >>> to
>> >>> >>> the conclusion that it may need some rework. The current
top-down,
>> >>> >>> depth-first, approach is inherently sequential. This is
partially
>> >>> >>> unnecessary because for the used recursive function
>> >>> >>> U_triangular_inv
>> >>> >>> (which is called many more times than cholesky), blocks
per level
>> >>> >>> are
>> >>> >>> independent. Therefore, we should look into a bottom-up,
>> >>> >>> breadth-first
>> >>> >>> approach to parallelize over the blocks in each level,
which could
>> >>> >>> be
>> >>> >>> done via parfor at script level.
>> >>> >>>
>> >>> >>> Regards,
>> >>> >>> Matthias
>> >>> >>>
>> >>> >>> On Sat, Apr 21, 2018 at 6:59 PM, Matthias Boehm
>> >>> >>> <mboehm7@gmail.com>
>> >>> >>> wrote:
>> >>> >>> > thanks for catching this - I just ran a toy example
and this
>> >>> >>> > seems
>> >>> >>> > to
>> >>> >>> > be a rewrite issue (there are specific right indexing
rewrites
>> >>> >>> > that
>> >>> >>> > collapse U[1:k,1:k] and U[1:k,k+1:n] into a single
access to U
>> >>> >>> > which
>> >>> >>> > helps for large distributed matrices). As a workaround,
you can
>> >>> >>> > set
>> >>> >>> > "sysml.optlevel" to 1 (instead of default 2, where
1 disables
>> >>> >>> > all
>> >>> >>> > rewrites), which worked fine for me. I'll fix this
later today.
>> >>> >>> > Also
>> >>> >>> > I'll fix the naming from "Choleskey" to "Cholesky".
Thanks
>> >>> >>> > again.
>> >>> >>> >
>> >>> >>> > Regards,
>> >>> >>> > Matthias
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > On Sat, Apr 21, 2018 at 6:28 PM, Qifan Pu <qifan.pu@gmail.com>
>> >>> >>> > wrote:
>> >>> >>> >> Hi Matthias,
>> >>> >>> >>
>> >>> >>> >> Thanks for the fast response and detailed information.
This is
>> >>> >>> >> really
>> >>> >>> >> helpful.
>> >>> >>> >>
>> >>> >>> >> I just tried to run it, and was tracing down a
indexing bug
>> >>> >>> >> that
>> >>> >>> >> can
>> >>> >>> >> be
>> >>> >>> >> repeated by simply running the test script of
triangle solve[1]
>> >>> >>> >> Caused by: org.apache.sysml.runtime.DMLRuntimeException:
>> >>> >>> >> Invalid
>> >>> >>> >> values
>> >>> >>> >> for
>> >>> >>> >> matrix indexing: [1667:3333,1:1666] must be within
matrix
>> >>> >>> >> dimensions
>> >>> >>> >> [1000,1000]
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Am I missing some configuration here?
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> [1]
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_systemml_blob_master_scripts_staging_scalable-5Flinalg_test_test-5Ftriangular-5Finv.dml&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=3mYOfURw_FSirAnoSv2pWvLSi1psso4F9RdGjEWL6yc&m=FvqDr_AKzY5EAD_GAXIJoot0Z09NtMUt8kLShXcJxqQ&s=zIEgt74yeZzCTqvLCgV_0J8ECApG541uUlbaGMcK8bs&e=
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Best,
>> >>> >>> >> Qifan
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> On Sat, Apr 21, 2018 at 4:06 PM, Matthias Boehm
>> >>> >>> >> <mboehm7@gmail.com>
>> >>> >>> >> wrote:
>> >>> >>> >>>
>> >>> >>> >>> Hi Qifan,
>> >>> >>> >>>
>> >>> >>> >>> thanks for your feedback. You're right, the
builtin functions
>> >>> >>> >>> cholesky, inverse, eigen, solve, svd, qr,
and lu are currently
>> >>> >>> >>> only
>> >>> >>> >>> supported as single-node operations because
they're still
>> >>> >>> >>> implemented
>> >>> >>> >>> via Apache commons.math.
>> >>> >>> >>>
>> >>> >>> >>> However, there is an experimental script for
distributed
>> >>> >>> >>> cholesky
>> >>> >>> >>> [1]
>> >>> >>> >>> which uses a recursive approach (with operations
that allow
>> >>> >>> >>> for
>> >>> >>> >>> automatic distributed computation) for matrices
larger than a
>> >>> >>> >>> user-defined block size. Once blocks become
small enough, we
>> >>> >>> >>> use
>> >>> >>> >>> again
>> >>> >>> >>> the builtin cholesky. Graduating this script
would require a
>> >>> >>> >>> broader
>> >>> >>> >>> set of experiments (and potential improvements)
but it simply
>> >>> >>> >>> did
>> >>> >>> >>> not
>> >>> >>> >>> have the highest priority so far. You might
want to give it a
>> >>> >>> >>> try
>> >>> >>> >>> though.
>> >>> >>> >>>
>> >>> >>> >>> Thanks again for your feedback - we'll consider
a higher
>> >>> >>> >>> priority
>> >>> >>> >>> for
>> >>> >>> >>> these distributed operations when discussing
the roadmap for
>> >>> >>> >>> the
>> >>> >>> >>> next
>> >>> >>> >>> releases.
>> >>> >>> >>>
>> >>> >>> >>> [1]
>> >>> >>> >>>
>> >>> >>> >>>
>> >>> >>> >>>
>> >>> >>> >>>
>> >>> >>> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_systemml_blob_master_scripts_staging_scalable-5Flinalg_cholesky.dml&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=3mYOfURw_FSirAnoSv2pWvLSi1psso4F9RdGjEWL6yc&m=FvqDr_AKzY5EAD_GAXIJoot0Z09NtMUt8kLShXcJxqQ&s=Yrj4GGcTlpZGRw34RoON_oO6-xDUtiIEUcO7-qIOyoc&e=
>> >>> >>> >>>
>> >>> >>> >>> Regards,
>> >>> >>> >>> Matthias
>> >>> >>> >>>
>> >>> >>> >>> On Sat, Apr 21, 2018 at 2:15 PM, Qifan Pu
<qifan.pu@gmail.com>
>> >>> >>> >>> wrote:
>> >>> >>> >>> > Hi,
>> >>> >>> >>> >
>> >>> >>> >>> > I would love to do distributed cholesky
on large matrix with
>> >>> >>> >>> > SystemML. I
>> >>> >>> >>> > found two related jiras (SYSTEMML-1213,
SYSTEMML-1163), but
>> >>> >>> >>> > AFAIK,
>> >>> >>> >>> > this
>> >>> >>> >>> > is
>> >>> >>> >>> > currently not implemented? I just wanted
to check.
>> >>> >>> >>> >
>> >>> >>> >>> > Best,
>> >>> >>> >>> > Qifan
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >
>
>

Mime
View raw message