Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 81222 invoked from network); 6 Jul 2010 06:28:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Jul 2010 06:28:56 -0000 Received: (qmail 84510 invoked by uid 500); 6 Jul 2010 06:28:55 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 84283 invoked by uid 500); 6 Jul 2010 06:28:53 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 84275 invoked by uid 99); 6 Jul 2010 06:28:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Jul 2010 06:28:53 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 209.85.216.177 as permitted sender) Received: from [209.85.216.177] (HELO mail-qy0-f177.google.com) (209.85.216.177) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Jul 2010 06:28:46 +0000 Received: by qyk1 with SMTP id 1so2036122qyk.1 for ; Mon, 05 Jul 2010 23:27:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=OPZJaZFiiaJFygttGZUrADisbVdWvTLVbY3kGR3raqU=; b=GjIR0amWmGVXjLWvNKsKgu0MJeQMzT/jvkHjsUX1L8MZ6F+vgTlPR10vH5owo4pOcr I+IaMthwPvVp6jQip8/YCqZC7hSsDpDSd/JNmtCCGzZemjbGT3+irp7VMZRrMnafaTLN tkT2z4lnU5JlVLd8qIOVSfochTA7OqYeXasQU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=BsAyWYWUaxSkJ3iVbJbIQsa+akrSQ3cyWpm3IklBA52estC62fC0lDSExUZ/TfHBCE vW83eYT9WvEOpN/je2jSFpGv39Tx0oAo2+Z8ucKhat2unuHyH/G0k53lum1MwZHq09JF 8Rg10zUsp92hXr+JC3sRwge/gw3u0IkMD7I9E= Received: by 10.224.86.222 with SMTP id t30mr2157711qal.79.1278397644194; Mon, 05 Jul 2010 23:27:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.3.7 with HTTP; Mon, 5 Jul 2010 23:27:04 -0700 (PDT) In-Reply-To: References: <5F706DD5-7052-4B52-BD45-BC9EF68B6C17@apache.org> <7170631B-1F65-4163-8691-0E385CA37634@apache.org> From: Ted Dunning Date: Mon, 5 Jul 2010 23:27:04 -0700 Message-ID: Subject: Re: SVD and Clustering To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=00c09f88d29976caab048ab22504 X-Virus-Checked: Checked by ClamAV on apache.org --00c09f88d29976caab048ab22504 Content-Type: text/plain; charset=UTF-8 Related to normalization, the original LSA team claimed better results with tf.idf weighting. I would tend to use log(1+tf) . idf instead. I think that term weighting of this sort is quite common. Document level normalization is a bit less common. It is common practice, however, to not normalize documents but instead to drop the first eigenvector on the theory that is where the document norm winds up anyway. I would imagine that normalizing documents to some degree would make the numerics of computing the SVD a bit better and save the extra work of computing and then throwing away that eigenvector. The first eigenvector also takes the load of centering the documents. I do know that I have forgotten to toss that first eigenvector on several occasions and been mystified for a time at how my results weren't as good. On Mon, Jul 5, 2010 at 11:16 PM, Jake Mannix wrote: > In my own experience, things like graphs (including bipartite graphs like > ratings matrices) I normalized before *and* after, but text I don't (unit) > normalize before, but do normalize after. > > The reasoning I use is that normalizing the rows of graphs has > a meaning in the context of the graph (you're doing the PageRank-like > thing of normalizing outflowing probability when looking at random > walks, for example, or for ratings matrices, you're saying that > everyone gets "one vote" to distribute amongst the things they've > rated [these apply for doing L_1 normalization, which isn't always > appropriate]), while I don't know if I buy the similar description of > what pre-normalizing the rows of a text corpus. > --00c09f88d29976caab048ab22504--