spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Earl <charles.ce...@gmail.com>
Subject Re: How to speed up MLlib LDA?
Date Tue, 22 Sep 2015 17:57:30 GMT
It seems that the Vowpal Wabbit version is most similar to what is in

https://github.com/intel-analytics/TopicModeling/blob/master/src/main/scala/org/apache/spark/mllib/topicModeling/OnlineHDP.scala
Although the Intel seems to implement the Hierarchical Dirichlet Process
(topics and subtopics) as opposed to the implementation in VW, which is
based on
   https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf
As opposed to Monte Carlo methods, in the HDP/VW they are using iterative
optimization of model parameters with respect predicted tokens (my best
shot at a one sentence).
The VW code is *highly* optimized.

https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/lda_core.cc
A fast inferencer for Spark LDA would be of great value.
C

On Tue, Sep 22, 2015 at 1:30 PM, Pedro Rodriguez <ski.rodriguez@gmail.com>
wrote:

> I helped some with the LDA and worked quite a bit on a Gibbs version. I
> don't know if the Gibbs version might help, but since it is not (yet) in
> MLlib, Intel Analytics kindly created a spark package with their adapted
> version plus a couple other LDA algorithms:
> http://spark-packages.org/package/intel-analytics/TopicModeling
> https://github.com/intel-analytics/TopicModeling
>
> It might be worth trying out. Do you know what LDA algorithm VW uses?
>
> Pedro
>
>
> On Tue, Sep 22, 2015 at 1:54 AM, Marko Asplund <marko.asplund@gmail.com>
> wrote:
>
>> Hi,
>>
>> I did some profiling for my LDA prototype code that requests topic
>> distributions from a model.
>> According to Java Mission Control more than 80 % of execution time during
>> sample interval is spent in the following methods:
>>
>> org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
>> org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
>> org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
>> 6.98%
>> java.lang.Double.valueOf(double); count: 31; 4.33%
>>
>> Is there any way of using the API more optimally?
>> Are there any opportunities for optimising the "topicDistributions" code
>> path in MLlib?
>>
>> My code looks like this:
>>
>> // executed once
>> val model = LocalLDAModel.load(ctx, ModelFileName)
>>
>> // executed four times
>> val samples = Transformers.toSparseVectors(vocabularySize,
>> ctx.parallelize(Seq(input))) // fast
>> model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
>> seems to take about 4 seconds to execute
>>
>>
>> marko
>>
>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 208-340-1703
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


-- 
- Charles

Mime
View raw message