Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 90327 invoked from network); 22 Nov 2010 21:08:58 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Nov 2010 21:08:58 -0000 Received: (qmail 48880 invoked by uid 500); 22 Nov 2010 21:09:28 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 48839 invoked by uid 500); 22 Nov 2010 21:09:28 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 48831 invoked by uid 99); 22 Nov 2010 21:09:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Nov 2010 21:09:28 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of fernando.fernandez.gonzalez@gmail.com designates 209.85.161.170 as permitted sender) Received: from [209.85.161.170] (HELO mail-gx0-f170.google.com) (209.85.161.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Nov 2010 21:09:19 +0000 Received: by gxk20 with SMTP id 20so28652gxk.1 for ; Mon, 22 Nov 2010 13:08:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=/pjYGf38wfEYHP0G4mzBxHFOqi1S66qG+6kw0ainicQ=; b=qPeO3+V+XTDHvcdsdZq9+k4owOHgWygjT6qqvlGLNWsXd0xmfSYRZzeCyBJSqzFXuD e3bEQ254bFl/e0p3AfZBDN2oR7qXqOlp9oZ2I+VKL2INYiFxZZUVcaEpd+JYOdiHz0C1 PgNZmV0OqVou8IwZD8gOk84ec5Thvg5naNKvw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=PW4mqtm5gzLPf6Woxx3i+G8izOiUCeTwK+l4J41kg6dryC9RTt32awnz/tWScntpUv pHWjOLYUEjyvBqhkxXjLJ+MGIPp1neknQqRVXEa7KFgkw0oWCY98I+yLf+wGNPLBzBfZ mrhF1wXsLqsKMvvanLpPxXBAqW1rNhLrLvZtA= MIME-Version: 1.0 Received: by 10.150.52.9 with SMTP id z9mr8196008ybz.128.1290460136815; Mon, 22 Nov 2010 13:08:56 -0800 (PST) Received: by 10.150.177.6 with HTTP; Mon, 22 Nov 2010 13:08:56 -0800 (PST) In-Reply-To: References: Date: Mon, 22 Nov 2010 22:08:56 +0100 Message-ID: Subject: Re: Interpreting the output of SVD From: =?ISO-8859-1?Q?Fernando_Fern=E1ndez?= To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=000e0cd6a8480d68b30495aaaa59 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd6a8480d68b30495aaaa59 Content-Type: text/plain; charset=ISO-8859-1 Lance, Columns of U are in some contexts called "latent factors". For example, if we are applying SVD over a Document(User)-Term(Items) matrix, Columns of U could be interpreted as a representation of groups of terms (words that have similar meaning or tend to appear together in documents of the same kind, so in this case this "latent" factors are "topics" in some way. Another example of this is when we apply the SVD factorization in the famous movie recommendation problem. The "latent" factors (columns of the U matrix) represent somewhat some kind of "movie topics" (Drama, terror, comedy, and possible combinations of these...). Note that if we are trying to make recommendations of movies, we will recommend movies that has a similar topic, i.e. we will recommend probably a whole topic, not an specific movie... but SVD helps us find what movies fall into that topic. Note that this "topic" could be in fact something more abstract than "Drama" or "comedy". The interpretation of V is more or less the "transpose" of these. In the movie example, the columns of V could be seen as a representation of users that have seen (or rated) the same movie. So if two movies have a similar topic, it has been possible been rated or seen by the same persons, so both movies will have similar values on the V colum representing that group of persons... Actually, Rows of U can be use to find distances between users (according to what the have rated), and rows of Vt can be used to find distances between movies (according to what people have rated them). Last, The values of S are as some other users pointed, can be seen as a "weight" of the importance of this "latent" factors when i'm trying to see the differences between movies or between users. Hope this helps. Please, any other user correct me if you see something wrong in my examples. Best, Fernando. 2010/11/22 Ted Dunning > Commonly the square root of S is applied to both U and V. S is a set of > importance weightings for the otherwise > normalized columns of U and V. > > On Mon, Nov 22, 2010 at 10:10 AM, Sean Owen wrote: > > > Hmm. I think I need to fix the second half of my analogy. > > > > It's really U x S that could be said to be users' preferences for > > pseudo-items. and S x VT could be said to be pseudo-users preferences for > > real items. S itself is a diagonal matrix of course and those values are > > kind of like "scaling factors" ... but I actually struggle to come up > with > > a > > good intuitive explanation of what S itself is (or really, U and V by > > themselves). > > > > Anyone smarter have a nice pithy analogy? > > > > On Mon, Nov 22, 2010 at 11:06 AM, Sean Owen wrote: > > > > > > In more CF-oriented terms, S is an expression of pseudo-users' > > preferences > > > for pseudo-items. And then U expresses how much each real user > > corresponds > > > to each pseudo-user, and likewise for V and items. > > > > > > To put out a speculative analogy -- let's say we're looking at users' > > > preferences for songs. The "pseudo-items" that the SVD comes up with > > might > > > correspond to something like genres, or logical groupings of songs. > > > "Pseudo-users" are something like types of listeners, perhaps > > corresponding > > > to demographics. > > > > > > Whereas an entry in the original matrix makes a statement like "Tommy > > likes > > > the band Filter", an entry in S makes a statement like "Teenage boys in > > > moderately affluent households like industrial metal". And U says how > > much > > > Tommy is part of this demographic, and V tells how much Filter is > > industrial > > > metal. > > > > > > > > > --000e0cd6a8480d68b30495aaaa59--