Hi all,
The LDA Vectorization patch is ready. You can take a look at:
https://issues.apache.org/jira/browse/MAHOUT683
Regards, Vasil
On Thu, Apr 21, 2011 at 9:47 AM, Vasil Vasilev wrote:
Ok. I am going to try out 1) suggested by Jake, then write couple of tests and then I will file the Jiras.
> and then I will file the Jiras.
On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll wrote:
On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:
>> > Hi Mahouters,
>> > I was experimenting with the LDA clustering algorithm on the Reuters
>> data
>> > set and I did several enhancements, which if you find interesting I
>> could
>> > contribute to the project:
>> >
>> > 1. Created termfrequency vectors pruner: LDA uses the tf vectors and
>> not
>> > the tfidf ones which result from seq2sparse. Due this fact words like
>> > "and", "where", etc. get also included in the resulting topics. To
>> prevent
>> > that I run seq2sparse with the whole tfidf sequence and then run the
>> > "pruner". It first calculates the standard deviation of the document
>> > frequencies of the words and then prunes all entries in the tf vectors
>> whose
>> > document frequency is bigger then 3 times the calculated standard
>> deviation.
>> > This ensures including most of the words population, but still pruning
>> the
>> > unnecessary ones.
>> >
>> > 2. Implemented the alphaestimation part of the LDA algorithm as
>> described
>> > in the Blei, Ng, Jordan paper. This leads to better results in
>> maximizing
>> > the loglikelihood for the same number of iterations. Just an example 
>> for
>> > 20 iterations on the reuters data set the enhanced algorithm reaches
>> value
>> > of 6975124.693072233, compared to 7304552.275676554 with the original
>> > implementation
>> >
>> > 3. Created LDA Vectorizer. It executes only the inference part of the
>> LDA
>> > algorithm based on the last LDA state and the input document vectors and
>> for
>> > each vector produces a vector of the gammas, that are result of the
>> > inference. The idea is that the vectors produced in this way can be used
>> for
>> > clustering with any of the existing algorithms (like canopy, kmeans,
>> etc.)
>> >
>> As Jake says, this all sounds great. Please see:
>> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
>>
>>
