spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 王磊(安全部) <wangleikidd...@didichuxing.com>
Subject Re: Could we expose log likelihood of EM algorithm in MLLIB?
Date Sat, 08 Oct 2016 03:43:52 GMT
https://issues.apache.org/jira/browse/SPARK-17825

Actually I had created a JIRA. Could you let me your progress to avoid duplicated work.

Thanks!

发件人: didi <wangleikidding@didichuxing.com<mailto:wangleikidding@didichuxing.com>>
日期: 2016年10月8日 星期六 上午12:21
至: Yanbo Liang <ybliang8@gmail.com<mailto:ybliang8@gmail.com>>
抄送: "dev@spark.apache.org<mailto:dev@spark.apache.org>" <dev@spark.apache.org<mailto:dev@spark.apache.org>>,
"user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

Thanks for replying.
When could you send out the PR?

发件人: Yanbo Liang <ybliang8@gmail.com<mailto:ybliang8@gmail.com>>
日期: 2016年10月7日 星期五 下午11:35
至: didi <wangleikidding@didichuxing.com<mailto:wangleikidding@didichuxing.com>>
抄送: "dev@spark.apache.org<mailto:dev@spark.apache.org>" <dev@spark.apache.org<mailto:dev@spark.apache.org>>,
"user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

It's a good question and I had similar requirement in my work. I'm copying the implementation
from mllib to ml currently, and then exposing the maximum log likelihood. I will send this
PR soon.

Thanks.
Yanbo

On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wangleikidding@didichuxing.com<mailto:wangleikidding@didichuxing.com>>
wrote:

Hi,

Do you guys sometimes need to get the log likelihood of EM algorithm in MLLIB?

I mean the value in this line https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L228

Now copying the code here:


        val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ +=
_)

        // Create new distributions based on the partial assignments
        // (often referred to as the "M" step in literature)
        val sumWeights = sums.weights.sum

        if (shouldDistributeGaussians) {
        val numPartitions = math.min(k, 1024)
        val tuples =
        Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
        val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, sigma, weight)
=>
        updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
        }.collect().unzip
        Array.copy(ws.toArray, 0, weights, 0, ws.length)
        Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
        } else {
        var i = 0
        while (i < k) {
        val (weight, gaussian) =
        updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), sums.weights(i), sumWeights)
        weights(i) = weight
        gaussians(i) = gaussian
        i = i + 1
        }
        }

        llhp = llh // current becomes previous
        llh = sums.logLikelihood // this is the freshly computed log-likelihood
        iter += 1
        compute.destroy(blocking = false)
In my application, I need to know log likelihood to compare effect for different number of
clusters.
And then I use the cluster number with the maximum log likelihood.

Is it a good idea to expose this value?




Mime
View raw message