From Ian Upright <ian-pub...@upright.net>
Subject Re: Yahoo's LDA code
Date Wed, 29 Jun 2011 15:20:58 GMT
I also wonder what memory limitations it may have as compared to the Mahout
implementation.  (with regards to number of terms/topics/documents)

Ian

>So I tried Yahoo LDA  on 52 M documents with 1000 topics.
>Yahoo LDA with a dictionary of 100k terms does 1 iteration every 30 minutes
>on a single machine using 4 cores.
>Mahout LDA using 20 nodes and a dictionary of 30k takes more than a day for
>an iteration and didn't complete (something about output error during the
>reduce step - this may be a CDHbeta3 issue not sure, since reuters clusters
>fine).
>Hopefully the ideas from the Yahoo version can be incorporated into the
>Mahout LDA.
>On Fri, Jun 10, 2011 at 6:49 AM, Federico Castanedo <castanedofede@gmail.com
>> wrote:
>> Hi all,
>>
>> i got through the referenced paper and seems that besides all the
>> distributed tasks the way the inference for \alpha and \beta
>> is performed was the key element on improved the LDA trained performance.
>> They use SGD for the hyperparameter adjustment of \alpha.
>> bests,
>> Federico
>> 2011/6/10 Jake Mannix <jake.mannix@gmail.com>
>>
>> > It's all c++, custom distributed processing, custom distributed
>> > coordination
>> > and storage.
>> > We can certainly try to port over the algorithmic ideas, but the
>> > distributed
>> > systems stuff would be a significant departure from our current setup -
>> > it's
>> > not a web service and it's not hadoop, and it's not a command line
>> utility
>> > it's a cluster of long-running processes all intercommunicating.  Sounds
>> > awesome, but that's a way's off from where we are now.
>> >
>> >  -jake
>> >
>> > On Thu, Jun 9, 2011 at 7:52 PM, Stanley Xu <wenhao.xu@gmail.com> wrote:
>> >
>> > > Awesome! Guess it would be much faster than then current version in
>> > Mahout.
>> > > Is that possible to just use this version in mahout?
>> > >
>> > > On Fri, Jun 10, 2011 at 8:12 AM, <jeremy@lewi.us> wrote:
>> > >
>> > > > Yahoo released its hadoop code for LDA
>> > > >
>> > >
>> >
>> http://blog.smola.org/post/6359713161/speeding-up-latent-dirichlet-allocation
