Vasil,
This sounds great!
On Wed, Apr 20, 2011 at 9:08 PM, Vasil Vasilev <vavasilev@gmail.com> wrote:
> Hi Mahouters,
>
> 1. Created termfrequency vectors pruner: LDA uses the tf vectors and not
> the tfidf ones which result from seq2sparse. Due this fact words like
> "and", "where", etc. get also included in the resulting topics. To prevent
> that I run seq2sparse with the whole tfidf sequence and then run the
> "pruner". It first calculates the standard deviation of the document
> frequencies of the words and then prunes all entries in the tf vectors
> whose
> document frequency is bigger then 3 times the calculated standard
> deviation.
> This ensures including most of the words population, but still pruning the
> unnecessary ones.
>
If you could add this to the whole seq2sparse functionality in general
(optionally),
this would be generally better than the minDf / maxDf way we currently do
this.
> 2. Implemented the alphaestimation part of the LDA algorithm as described
> in the Blei, Ng, Jordan paper. This leads to better results in maximizing
> the loglikelihood for the same number of iterations. Just an example  for
> 20 iterations on the reuters data set the enhanced algorithm reaches value
> of 6975124.693072233, compared to 7304552.275676554 with the original
> implementation
>
Awesome.
> 3. Created LDA Vectorizer. It executes only the inference part of the LDA
> algorithm based on the last LDA state and the input document vectors and
> for
> each vector produces a vector of the gammas, that are result of the
> inference. The idea is that the vectors produced in this way can be used
> for
> clustering with any of the existing algorithms (like canopy, kmeans, etc.)
>
Yeah, I've got code which does this too, and keep meaning to clean it up
for submission, but if yours is ready to go, file a JIRA, submit a patch! :)
The gamma vector is totally helpful, it lets you do LSIstyle search, as
well.
jake
