well, these are all very interesting questions  let's discuss them
separately:
1) Average/worstcase formulas in SystemML: The referenced formulas in our
code base follow fairly simple and intuitive patterns.
* For the averagecase estimate, we assume nonzeros are uniformly
distributed and hence compute the output sparsity as the probability that
an output cell is nonzero. That is 1 minus the probability that it's a
zero. For a single multiplyadd, this is (1sp1*sp2) and hence
(1sp1*sp2)^k for the entire dot product over the common dimension of
length k (assuming the output is zero if and only if all multiples are
zero). Due to the assumption of uniformity, the output dimensions are
irrelevant here.
* For the worstcast estimates as actually used within SystemML's memory
estimates (in order to guarantee that we do not run out of memory), we
however assume a worstcase structure, which is a column vector of
nonzeros in the lefthandside and an aligned row vector in the
righthandside. This gives us min(1, nnz1/m) * min(1, nnz2/n), which is
obviously very conservative.
We introduced these formulas in our optimizer paper [1] but other groups
either used (for extensions) or at least referenced exactly the same
formulas [2, 3].
2) Relevance in practice: There are three reasons why we still rely on such
a very conservative estimate. First, it allows us to guarantee memory
requirements, which also includes potential transitions from dense to
sparse. Second, all our matrix multiplications, except ultrasparse, use
dense outputs anyway to allow efficient cacheconscious implementations.
However, note that the output estimate of an operation implicitly affects
the memory estimate of any data dependent operations as it becomes part of
their input memory estimate. Third, whenever we cannot estimate the output
sparsity and it matters because the dimensions are large too, subsequent
operations are flagged for dynamic recompilation. During dynamic
recompilation, the exact number of nonzeros are known and used to
recompile runtime plans. There are even a couple of rewrites that
explicitly split HOP DAGs after permutation matrix multiples, to create
recompilation hooks before any subsequent operations. In practice, dynamic
recompilation is actually able to mitigate compiler shortcomings such as
these conservative estimates very well.
3) Sketches: Overall sketching is indeed a very common approach to remove
the assumption of uniformly distributed nonzeros and thus arrive at better
estimates for skewed realworld matrices. The hashing technique from the
paper Dylan mentioned is a very compelling example of that (when I first
read this paper a while ago, it reminded me of KMV synopsis and the
propagation of such). [2] and [3] also used forms of sketches in terms of
systematic density maps and sampling of row/columns pairs for chains of
matrix multiplications. A related challenge is the estimation of sparsity
for intermediates of an entire matrix multiplication chain. In order to
evaluate alternative evaluation plans, we need to recursively compute
sketches for intermediates that cannot be sampled directly. If anybody is
interested in that, let me know and we can have a more detailed discussion.
Regards,
Matthias
[1] Matthias Boehm, Douglas R. Burdick, Alexandre V. Evfimievski, Berthold
Reinwald, Frederick R. Reiss, Prithviraj Sen, Shirish Tatikonda, Yuanyuan
Tian: SystemML's Optimizer: Plan Generation for LargeScale Machine
Learning Programs. IEEE Data Eng. Bull. 37(3): 5262 (2014),
http://sites.computer.org/debull/A14sept/p52.pdf, page 7
[2] David Kernert, Frank Köhler, Wolfgang Lehner: SpMacho  Optimizing
Sparse Linear Algebra Expressions with Probabilistic Density Estimation.
EDBT 2015: 289300, http://openproceedings.org/2015/conf/edbt/paper117.pdf,
page 6
[3] Yongyang Yu, MingJie Tang, Walid G. Aref, Qutaibah M. Malluhi, Mostafa
M. Abbas, Mourad Ouzzani: InMemory Distributed Matrix Computation
Processing and Optimization. ICDE 2017: 10471058, page 3
On Tue, Jun 20, 2017 at 2:46 PM, Dylan Hutchison <dhutchis@cs.washington.edu
> wrote:
> Hi Nakul,
>
> The paper I referenced uses the term "Boolean matrix", but really it
> applies to any matrix product in which two properties hold:
>
> 1. Zero product property. a * b = 0 ==> a = 0 or b = 0
> 2. Few, if any, "sums to zero". There should be few cases of nonzero
> partial products summing to zero.
>
> Property #1 holds for standard arithmetic * over the reals. If we multiply
> matrices with positive entries, then there are no sums to zero. Otherwise,
> if there are many positive and negative entries, the number of nonzero
> entries in the result may be fewer than the paper's algorithm would
> predict.
>
> Specifically, under these two properties, nnz( A * B ) = nnz( A_0 *
> B_0 ), where A_0 is the "zero norm", which maps nonzero entries to
> 1 and zero entries to 0.
>
> I don't know about the origin of the current heuristic for SystemML.
>
> Cheers, Dylan
>
>
> On Tue, Jun 20, 2017 at 2:00 PM, Nakul Jindal <nakul02@gmail.com> wrote:
>
> > Thank you Dylan. The paper seems interesting. The abstract reads as
> follows
> > : "We consider the problem of doing fast and reliable estimation of the
> > number z of nonzero entries in a sparse boolean matrix product"...
> > I think they maybe doing this for boolean matrices. The code that I
> pointed
> > to in SystemML is used for all matrices.
> > I am not sure if and how their technique translates to nonboolean
> > matrices.
> >
> > Also, I am asking for an explanation of what is implemented as opposed to
> > what can be.
> > I was wondering if the neat looking formula has been gotten from some
> > literature somewhere?
> >
> > Nakul
> >
> >
> >
> >
> > On Tue, Jun 20, 2017 at 1:42 PM, Dylan Hutchison <
> > dhutchis@cs.washington.edu
> > > wrote:
> >
> > > In case it is helpful, this 2010 paper
> > > <https://www.semanticscholar.org/paper/BetterSize
> > > EstimationforSparseMatrixProductsAmossenCampagna/
> > > 52f70406bb4d887e1b79af81a746184c5723848a>
> > > is my favorite for estimating the nnz of a matrix product with high
> > > confidence. The algorithm is much more involved than the simple
> SystemML
> > > calculation, because it looks at samples from the actual data.
> > >
> > > Here are some notes I have on the paper:
> > >
> > > There is a sketch algorithm [2] that gives a (1 + ε) approximation z̃
> to
> > z
> > > for any ε > 4 / (z^.25) in O(m)
> > > time and with a bound on the number of I/Os. In relational algebra,
> this
> > is
> > > estimating
> > > π_{ik}( A(i, j) ⋈ B(j, k) ). The idea behind the algorithm is
> finding,
> > > for some tuning parameter
> > > k that varies with z, the ksmallest value of a hash function h(i; k).
> > The
> > > larger this value is, the more
> > > is and ks that match for a given j. Repeat this for all js. Different
> (i,
> > > k)s contribute to different entries
> > > in the result matrix and have different hash values. A sketch for this
> > > algorithm can be incrementally
> > > maintained via independent samples on is and ks. Computing the z̃
> > estimate
> > > from the sample takes o(n)
> > > time for large enough z. The larger z is, the smaller a sample size we
> > > need. (Need large samples for
> > > small z.) (Everything above assumes the zero product property and
> > > zerosumfree).
> > >
> > >
> > >
> > > On Tue, Jun 20, 2017 at 12:06 PM, Nakul Jindal <nakul02@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > There is code in systemml to “estimate” the output sparsity of a
> matrix
> > > > multiplication operation between two sparse matrices.
> > > > Is there a citation for this? If not, have we done any tests to
> figure
> > > out
> > > > how good these estimates are?
> > > > https://github.com/nakul02/systemml/blob/df8d4a63d8d09cae94b
> > > > 6ca2634e31da554302c72/src/main/java/org/apache/sysml/
> > > > hops/OptimizerUtils.java#L944L953
> > > >
> > > > Thanks,
> > > > Nakul
> > > >
> > >
> >
>
