spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hao Ren (JIRA)" <>
Subject [jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible
Date Sat, 03 Dec 2016 20:39:58 GMT


Hao Ren commented on SPARK-18581:

I checked several (mu, sigma) pairs in R.
The package I used is: mvtnorm
The numerical difference of pdf between mllib and R is negligible, no matter whether the sigma
is invertible or (near-)singular.
Hence, there is no problems here.

Here is my code:
which can generate R code for cross check

> MultivariateGaussian does not check if covariance matrix is invertible
> ----------------------------------------------------------------------
>                 Key: SPARK-18581
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.6.2, 2.0.2
>            Reporter: Hao Ren
> When training GaussianMixtureModel, I found some probability much larger than 1. That
leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of determinant
of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive defined matrix.

> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 0
=> pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally make logpdf
very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance based on
machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix is non
invertible which is a contradiction to the assumption that it should be invertible.
> So we should check if there a single value is smaller than the tolerance before computing
the pseudo determinant

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message