[ https://issues.apache.org/jira/browse/SPARK18581?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=15718706#comment15718706
]
Hao Ren commented on SPARK18581:

I checked several (mu, sigma) pairs in R.
The package I used is: mvtnorm
The numerical difference of pdf between mllib and R is negligible, no matter whether the sigma
is invertible or (near)singular.
Hence, there is no problems here.
Here is my code: https://gist.github.com/invkrh/2a5422c01a3c3a063f504f1f099cbdae
which can generate R code for cross check
> MultivariateGaussian does not check if covariance matrix is invertible
> 
>
> Key: SPARK18581
> URL: https://issues.apache.org/jira/browse/SPARK18581
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.6.2, 2.0.2
> Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 1. That
leads me to that fact that, the value returned by MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of determinant
of the covariance matrix.
> The computation is simplified by using pseudodeterminant of a positive defined matrix.
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 0
=> pseudodeterminant will be very close to zero,
> Thus, log(pseudodeterminant) will be a large negative number which finally make logpdf
very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala,
> """
> Singular values are considered to be nonzero only if they exceed a tolerance based on
machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix is non
invertible which is a contradiction to the assumption that it should be invertible.
> So we should check if there a single value is smaller than the tolerance before computing
the pseudo determinant

This message was sent by Atlassian JIRA
(v6.3.4#6332)

To unsubscribe, email: issuesunsubscribe@spark.apache.org
For additional commands, email: issueshelp@spark.apache.org
