Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F07B2364B for ; Thu, 28 Apr 2011 07:54:30 +0000 (UTC) Received: (qmail 12828 invoked by uid 500); 28 Apr 2011 07:54:30 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 12704 invoked by uid 500); 28 Apr 2011 07:54:28 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 12695 invoked by uid 99); 28 Apr 2011 07:54:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Apr 2011 07:54:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of vavasilev@gmail.com designates 209.85.215.42 as permitted sender) Received: from [209.85.215.42] (HELO mail-ew0-f42.google.com) (209.85.215.42) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Apr 2011 07:54:20 +0000 Received: by ewy2 with SMTP id 2so1421370ewy.1 for ; Thu, 28 Apr 2011 00:53:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=W5k+I6IGhuE6C7jEIBEB5Tj99AsXkYkS1X5YEZF0MKU=; b=gQJxZ/ZGYHigV6fqQzO52j88dXJoBCTvZXOuRsO3ospVUECMHdPpK6d+ttOH/POdzl +J3hXLqy1gsVVNtaD5rOqGXnty6LOYjie8jdYTzPv744fnrc7N0NZxsMLFUc8PzjznJB DVYl4V/oVhmpfBI87Mqu2nB3A0aUa9kF7NHAY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=DevYdSXWj5kqyd5iDUYx9I367YMvGwGeDxXtsKLXXX6f3Ju2NHIo0LqWtEEUtBZTdL LFN2B2m+cn1+31jB9rZ2joGxDuOHr/NBLi/JMg3i6KGaV8CtCfqWSLpCow2ryWwV/LNd YCWC7HZQfKa4xHigxz+FqkZI+LFI4ibpTMEEQ= MIME-Version: 1.0 Received: by 10.213.19.2 with SMTP id y2mr430792eba.103.1303977239187; Thu, 28 Apr 2011 00:53:59 -0700 (PDT) Received: by 10.213.7.139 with HTTP; Thu, 28 Apr 2011 00:53:59 -0700 (PDT) In-Reply-To: References: <68907B9D-D0F9-4692-9BCF-F67BF8A8D8D0@apache.org> Date: Thu, 28 Apr 2011 10:53:59 +0300 Message-ID: Subject: Re: LDA related enhancements From: Vasil Vasilev To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0015174bdf5e2311af04a1f5dc4b --0015174bdf5e2311af04a1f5dc4b Content-Type: text/plain; charset=ISO-8859-1 Hi all, The LDA Vectorization patch is ready. You can take a look at: https://issues.apache.org/jira/browse/MAHOUT-683* *Regards, Vasil* * On Thu, Apr 21, 2011 at 9:47 AM, Vasil Vasilev wrote: > Ok. I am going to try out 1) suggested by Jake, then write couple of tests > and then I will file the Jira-s. > > > On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll wrote: > >> >> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote: >> >> > Hi Mahouters, >> > >> > I was experimenting with the LDA clustering algorithm on the Reuters >> data >> > set and I did several enhancements, which if you find interesting I >> could >> > contribute to the project: >> > >> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and >> not >> > the tf-idf ones which result from seq2sparse. Due this fact words like >> > "and", "where", etc. get also included in the resulting topics. To >> prevent >> > that I run seq2sparse with the whole tf-idf sequence and then run the >> > "pruner". It first calculates the standard deviation of the document >> > frequencies of the words and then prunes all entries in the tf vectors >> whose >> > document frequency is bigger then 3 times the calculated standard >> deviation. >> > This ensures including most of the words population, but still pruning >> the >> > unnecessary ones. >> > >> > 2. Implemented the alpha-estimation part of the LDA algorithm as >> described >> > in the Blei, Ng, Jordan paper. This leads to better results in >> maximizing >> > the log-likelihood for the same number of iterations. Just an example - >> for >> > 20 iterations on the reuters data set the enhanced algorithm reaches >> value >> > of -6975124.693072233, compared to -7304552.275676554 with the original >> > implementation >> > >> > 3. Created LDA Vectorizer. It executes only the inference part of the >> LDA >> > algorithm based on the last LDA state and the input document vectors and >> for >> > each vector produces a vector of the gammas, that are result of the >> > inference. The idea is that the vectors produced in this way can be used >> for >> > clustering with any of the existing algorithms (like canopy, kmeans, >> etc.) >> > >> >> As Jake says, this all sounds great. Please see: >> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute >> >> > --0015174bdf5e2311af04a1f5dc4b--