Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 383E7D4DC for ; Fri, 29 Jun 2012 22:43:18 +0000 (UTC) Received: (qmail 51051 invoked by uid 500); 29 Jun 2012 22:43:16 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 50994 invoked by uid 500); 29 Jun 2012 22:43:16 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 50985 invoked by uid 99); 29 Jun 2012 22:43:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 22:43:16 +0000 X-ASF-Spam-Status: No, hits=-0.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dlieu.7@gmail.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vc0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 22:43:11 +0000 Received: by vcbgb22 with SMTP id gb22so5080146vcb.1 for ; Fri, 29 Jun 2012 15:42:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=akHoumOEep7WbpR8pCbuQSHCawRlVW4G1xPSr1zOMyo=; b=mBuecj4bJC9WtHG33zxajx/ZwYCyNQbYwGVE1ARR4DKEGGJX64sfXYp4nH3kHvK3yG XHRKtqoaTkOaY619USnEMzeGzCp+HoTOGJNgkW46mjWdXAqYK5X0CeOh4sfZN9/hqPHN qNBNq1PlgLgKE2xBWApVEZRTYZ/QODjxnESUC7PEGQi+QMwf4o7xtsnpJ4zOLU96MqQF dKDowKuqdgWDSREGhj0fzPkET8rie5JaxC5+41YpR4T/NRWKwaJ9Y56q0O/7hgrC6b4w V0s+GLvR50zSiS6MrGCupENlvwYjvABhg8Z9dRTzpWKVnrGEanyxmAOkYv+j6kH8cyoG utog== MIME-Version: 1.0 Received: by 10.220.141.194 with SMTP id n2mr1935163vcu.58.1341009770956; Fri, 29 Jun 2012 15:42:50 -0700 (PDT) Received: by 10.52.94.175 with HTTP; Fri, 29 Jun 2012 15:42:50 -0700 (PDT) In-Reply-To: References: Date: Fri, 29 Jun 2012 15:42:50 -0700 Message-ID: Subject: Re: LSI using Mahout ssvd - folding a new doc into the space From: Dmitriy Lyubimov To: user@mahout.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org PS of course folding in a considerable amount of new data is not recommended since when you fold-in, you are not learning any new semantic space. you are only able to project new documents into previously learned sematic space and keep measuring similarities to them in that space. (which sometimes is good if you want learning to happen in a quite strict semantic space and consider all new data just from the point of view of that space). On Fri, Jun 29, 2012 at 3:39 PM, Dmitriy Lyubimov wrote= : > Yes. the fold-in formula is given in the link you mentioned , formulas > (2) and (3), of which you probably need only one depending from which > way you are going. Usually you are folding in new documents (rows of > U), so you need formula (2) to add new folded-in rows. > > Also as comment implies, your new observation vector for document is > very sparse (as document is unlikely to have all tokens you observed > in the corpus), so actual computation of (2) may be optimized quite a > bit if V is indexed row-wise and specific rows of V (which is > essentially dictionary vectors) can be yanked out very quickly. > > -d > > On Fri, Jun 29, 2012 at 3:13 PM, Chris Hokamp wr= ote: >> Thanks for the quick response. So I will create a new diagonal matrix wi= th >> the reciprocals of the eigenvalues, and multiply by that. I took a look = at >> the slides (very nice presentation!), but it seems that I won't even nee= d >> to go this far, as I should be able to take E^(-1) x U^(T) x docvector, = and >> U is available from the output of ssvd. I'm basing this assumption on pa= ges >> 2/3 of [1]. >> >> Thanks again for the help, >> Chris >> >> [1] >> https://cwiki.apache.org/MAHOUT/stochastic-singular-value-decomposition.= data/SSVD-CLI.pdf >> >> On Fri, Jun 29, 2012 at 4:31 PM, Sean Owen wrote: >> >>> Well the inverse of a diagonal matrix like that is just going to be a >>> diagonal matrix holding the reciprocals (1/x) of the values. That much >>> is easy. But you need to invert more than that to fold in. >>> >>> I admit even I don't know the details of the Mahout implementation >>> you're using, but I imagine the overall principle is the same as the >>> fold-in described in ... oh wait, look at that, in a preso I posted a >>> while ago: http://www.slideshare.net/srowen/matrix-factorization =A0Loo= k >>> at the last few slides; I think it's kind of a useful / simple way to >>> think of it. >>> >>> Sean >>> >>> On Fri, Jun 29, 2012 at 10:27 PM, Chris Hokamp >>> wrote: >>> > Hi all, >>> > >>> > I'm trying to implement Latent Semantic Indexing using the mahout ssv= d >>> > tool, and I'm having trouble understanding how I can use the output o= f >>> ssvd >>> > Mahout to 'fold' new queries (documents) into the LSI space. >>> Specifically, >>> > I can't find a way to multiply a vector representing a query by the >>> inverse >>> > of the matrix of singular values - I can't find a way to solve for th= e >>> > inverse of the diagonal matrix of singular values. >>> > >>> > I can generate the output matrices using ssvd, and compare document/t= erm >>> > vectors using cosine similarity, but I'm stumped when it comes to >>> folding a >>> > new document into the space. >>> > >>> > Any thoughts or guidance would be appreciated. >>> > >>> > Cheers, >>> > Chris >>>