Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: domain of vavasilev@gmail.com
 designates 209.85.215.42 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=DevYdSXWj5kqyd5iDUYx9I367YMvGwGeDxXtsKLXXX6f3Ju2NHIo0LqWtEEUtBZTdL
         LFN2B2m+cn1+31jB9rZ2joGxDuOHr/NBLi/JMg3i6KGaV8CtCfqWSLpCow2ryWwV/LNd
         YCWC7HZQfKa4xHigxz+FqkZI+LFI4ibpTMEEQ=
MIME-Version: 1.0
In-Reply-To: <BANLkTi=79-e0hy_k2hdLZdiT=drrt=HsFA@mail.gmail.com>
References: <BANLkTimmEouQz4peA-qRg0Lwf4VGzAfaVg@mail.gmail.com>
	<68907B9D-D0F9-4692-9BCF-F67BF8A8D8D0@apache.org>
	<BANLkTi=79-e0hy_k2hdLZdiT=drrt=HsFA@mail.gmail.com>
Date: Thu, 28 Apr 2011 10:53:59 +0300
Message-ID: <BANLkTinCovze2iO1Bvg7dkFFwcKMbvW8KQ@mail.gmail.com>
Subject: Re: LDA related enhancements
From: Vasil Vasilev <vavasilev@gmail.com>
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=0015174bdf5e2311af04a1f5dc4b

--0015174bdf5e2311af04a1f5dc4b
Content-Type: text/plain; charset=ISO-8859-1

Hi all,

The LDA Vectorization patch is ready. You can take a look at:
https://issues.apache.org/jira/browse/MAHOUT-683*

*Regards, Vasil*
*
On Thu, Apr 21, 2011 at 9:47 AM, Vasil Vasilev <vavasilev@gmail.com> wrote:

> Ok. I am going to try out 1) suggested by Jake, then write couple of tests
> and then I will file the Jira-s.
>
>
> On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll <gsingers@apache.org>wrote:
>
>>
>> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:
>>
>> > Hi Mahouters,
>> >
>> > I was experimenting with the LDA clustering algorithm on the Reuters
>> data
>> > set and I did several enhancements, which if you find interesting I
>> could
>> > contribute to the project:
>> >
>> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and
>> not
>> > the tf-idf ones which result from seq2sparse. Due this fact words like
>> > "and", "where", etc. get also included in the resulting topics. To
>> prevent
>> > that I run seq2sparse with the whole tf-idf sequence and then run the
>> > "pruner". It first calculates the standard deviation of the document
>> > frequencies of the words and then prunes all entries in the tf vectors
>> whose
>> > document frequency is bigger then 3 times the calculated standard
>> deviation.
>> > This ensures including most of the words population, but still pruning
>> the
>> > unnecessary ones.
>> >
>> > 2. Implemented the alpha-estimation part of the LDA algorithm as
>> described
>> > in the Blei, Ng, Jordan paper. This leads to better results in
>> maximizing
>> > the log-likelihood for the same number of iterations. Just an example -
>> for
>> > 20 iterations on the reuters data set the enhanced algorithm reaches
>> value
>> > of -6975124.693072233, compared to -7304552.275676554 with the original
>> > implementation
>> >
>> > 3. Created LDA Vectorizer. It executes only the inference part of the
>> LDA
>> > algorithm based on the last LDA state and the input document vectors and
>> for
>> > each vector produces a vector of the gammas, that are result of the
>> > inference. The idea is that the vectors produced in this way can be used
>> for
>> > clustering with any of the existing algorithms (like canopy, kmeans,
>> etc.)
>> >
>>
>> As Jake says, this all sounds great.  Please see:
>> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
>>
>>
>

--0015174bdf5e2311af04a1f5dc4b--