mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: LDA related enhancements
Date Thu, 21 Apr 2011 04:27:35 GMT

  This sounds great!

On Wed, Apr 20, 2011 at 9:08 PM, Vasil Vasilev <> wrote:

> Hi Mahouters,
> 1. Created term-frequency vectors pruner: LDA uses the tf vectors and not
> the tf-idf ones which result from seq2sparse. Due this fact words like
> "and", "where", etc. get also included in the resulting topics. To prevent
> that I run seq2sparse with the whole tf-idf sequence and then run the
> "pruner". It first calculates the standard deviation of the document
> frequencies of the words and then prunes all entries in the tf vectors
> whose
> document frequency is bigger then 3 times the calculated standard
> deviation.
> This ensures including most of the words population, but still pruning the
> unnecessary ones.

If you could add this to the whole seq2sparse functionality in general
this would be generally better than the minDf / maxDf way we currently do

> 2. Implemented the alpha-estimation part of the LDA algorithm as described
> in the Blei, Ng, Jordan paper. This leads to better results in maximizing
> the log-likelihood for the same number of iterations. Just an example - for
> 20 iterations on the reuters data set the enhanced algorithm reaches value
> of -6975124.693072233, compared to -7304552.275676554 with the original
> implementation


> 3. Created LDA Vectorizer. It executes only the inference part of the LDA
> algorithm based on the last LDA state and the input document vectors and
> for
> each vector produces a vector of the gammas, that are result of the
> inference. The idea is that the vectors produced in this way can be used
> for
> clustering with any of the existing algorithms (like canopy, kmeans, etc.)

Yeah, I've got code which does this too, and keep meaning to clean it up
for submission, but if yours is ready to go, file a JIRA, submit a patch! :)

The gamma vector is totally helpful, it lets you do LSI-style search, as


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message