Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5066DD656 for ; Fri, 10 Aug 2012 18:07:59 +0000 (UTC) Received: (qmail 40931 invoked by uid 500); 10 Aug 2012 18:07:57 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 40827 invoked by uid 500); 10 Aug 2012 18:07:57 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 40819 invoked by uid 99); 10 Aug 2012 18:07:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Aug 2012 18:07:57 +0000 X-ASF-Spam-Status: No, hits=-0.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dlieu.7@gmail.com designates 209.85.215.42 as permitted sender) Received: from [209.85.215.42] (HELO mail-lpp01m010-f42.google.com) (209.85.215.42) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Aug 2012 18:07:49 +0000 Received: by lahl5 with SMTP id l5so2081214lah.1 for ; Fri, 10 Aug 2012 11:07:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=luEpbY7klyUNYV3V/8VIdEMOYbA8s1rR0TRfbKSMQkg=; b=IHvLJ0pW12z6iTDJP0Zm4k/yQQReYc9/s4pOAMzW4VinCxmZESr/Pg+AEvTnDnW/g9 503HxcL+NMDc4k0JogZ5CXy3N/9dvCdp3rmSbGBzLVMz5gP+zGDdvFqxwTjnQFFzQBTB /QChyusOChd6UjWFEAYkHXqEpNN+s5/rDLuRGnCx0qAmt7Ej0euJ7BsPlPzSfSD/tj29 Vfy1NyFhnDLt/7Ow48QOPRJO06q74vzd9xarwZOe7F8QYphrRK2WgveNvGFwkIdV74MB 5sc0WTwxhfR394g0360WkbrA9fvKPC383/EfBTVunimGu6dras9C/6GlAWHN/o2h4SvF Jyjw== MIME-Version: 1.0 Received: by 10.112.25.106 with SMTP id b10mr2765189lbg.36.1344622048696; Fri, 10 Aug 2012 11:07:28 -0700 (PDT) Received: by 10.112.86.42 with HTTP; Fri, 10 Aug 2012 11:07:28 -0700 (PDT) In-Reply-To: <01E82946-14A4-4CB4-AC25-C629BCAA4A7E@gmail.com> References: <18467466-39D5-4508-8FFF-800B224074D2@gmail.com> <01E82946-14A4-4CB4-AC25-C629BCAA4A7E@gmail.com> Date: Fri, 10 Aug 2012 11:07:28 -0700 Message-ID: Subject: Re: SSVD for dimensional reduction + Kmeans From: Dmitriy Lyubimov To: user@mahout.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable It happens because of internal constraints stemming from blocking. it happens when a split of A (input) has less than (k+p) rows at which point blocks are too small (or rather, to short) to successfully perform a QR on . This also means, among other things, k+p cannot be more than your total number of rows in the input. It is also possible that input A is way too wide or k+p is way too big so that an arbitrary split does not fetch at least k+p rows of A, but in practice i haven't seen such cases in practice yet. If that happens, there's an option to increase minSplitSize (which would undermine MR mappers efficiency somewhat). But i am pretty sure it is not your case. But if your input is shorter than k+p, then it is a case too small for SSVD. in fact, it probably means you can solve test directly in memory with any solver. You can still use SSVD with k=3Dm and p=3D0 (I think) in this case and get exact (non-reduced rank) decomposition equivalent with no stochastic effects, but that is not what it is for really. Assuming your input is m x n, can you tell me please what your m, n, k and p are? thanks. -D On Fri, Aug 10, 2012 at 9:21 AM, Pat Ferrel wrote: > There seems to be some internal constraint on k and/or p, which is making= a test difficult. The test has a very small input doc set and choosing the= wrong k it is very easy to get a failure with this message: > > java.lang.IllegalArgumentException: new m can't be less than n > at org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensT= hinSolver.adjust(GivensThinSolver.java:109) > > I have a working test but I had to add some docs to the test data and hav= e tried to reverse engineer the value for k (desiredRank). I came up with t= he following but I think it is only an accident that it works. > > int p =3D 15; //default value for CLI > int desiredRank =3D sampleData.size() - p - 1;//number of docs - = p - 1, ?????? not sure why this works > > This seems likely to be an issue only because of the very small data set = and the relationship of rows to columns to p to k. But for the purposes of = creating a test if someone (Dmitriy?) could tell me how to calculate a reas= onable p and k from the dimensions of the tiny data set it would help. > > This test is derived from a non-active SVD test but I'd be up for cleanin= g it up and including it as an example in the working but non-active tests.= I also fixed a couple trivial bugs in the non-active Lanczos tests for wha= t it's worth. > > > On Aug 9, 2012, at 4:47 PM, Dmitriy Lyubimov wrote: > > Reading "overview and usage" doc linked on that page > https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Va= lue+Decomposition > should help to clarify outputs and usage. > > > On Thu, Aug 9, 2012 at 4:44 PM, Dmitriy Lyubimov wrot= e: >> On Thu, Aug 9, 2012 at 4:34 PM, Pat Ferrel wrote: >>> Quoth Grant Ingersoll: >>>> To put this in bin/mahout speak, this would look like, munging some na= mes and taking liberties with the actual argument to be passed in: >>>> >>>> bin/mahout svd (original -> svdOut) >>>> bin/mahout cleansvd ... >>>> bin/mahout transpose svdOut -> svdT >>>> bin/mahout transpose original -> originalT >>>> bin/mahout matrixmult originalT svdT -> newMatrix >>>> bin/mahout kmeans newMatrix >>> >>> I'm trying to create a test case from testKmeansDSVD2 to use SSVDSolver= . Does SSVD require the EigenVerificationJob to clean the eigen vectors? >> >> No >> >>> if so where does SSVD put the equivalent of DistributedLanczosSolver.RA= W_EIGENVECTORS? Seems like they should be in V* but SSVD creates V so shoul= d I transpose V* then run it through the EigenVerificationJob? >> no >> >> SSVD is SVD, meaning it produces U and V with no further need to clean t= hat >> >>> I get errors when I do so trying to figure out if I'm on the wrong trac= k. >