Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E869D7A71 for ; Mon, 5 Dec 2011 22:05:58 +0000 (UTC) Received: (qmail 59745 invoked by uid 500); 5 Dec 2011 22:05:58 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 59704 invoked by uid 500); 5 Dec 2011 22:05:58 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 59696 invoked by uid 99); 5 Dec 2011 22:05:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2011 22:05:58 +0000 X-ASF-Spam-Status: No, hits=-0.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dlieu.7@gmail.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vx0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Dec 2011 22:05:50 +0000 Received: by vcbgb30 with SMTP id gb30so9319983vcb.1 for ; Mon, 05 Dec 2011 14:05:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=K+1SeHd3D33JPJin9ab5JKtKiTllkpbdbvu8/QiusMI=; b=NpWmProYET/jZ0aVeaIw6Yvomj07fC+5sOZwfdJ3xskpkzVsxUsaRLa5yesExa3FYt Zpoh4pNFe/Lewo97GGMHgsQslJBJuCJXt/22VD+X9B+BnrMeb7CUa62NFoyncgC3LkeP d/7IaykF4T7VOXpOTtNFO/NSGXZw7EkKBk6ug= MIME-Version: 1.0 Received: by 10.220.150.212 with SMTP id z20mr1314903vcv.58.1323122729216; Mon, 05 Dec 2011 14:05:29 -0800 (PST) Received: by 10.52.100.228 with HTTP; Mon, 5 Dec 2011 14:05:29 -0800 (PST) In-Reply-To: <812FA692-53B6-4669-9EE2-7F9C648BC7FB@gmail.com> References: <812FA692-53B6-4669-9EE2-7F9C648BC7FB@gmail.com> Date: Mon, 5 Dec 2011 14:05:29 -0800 Message-ID: Subject: Re: When is PCA expected to be fully implemented into Mahout? From: Dmitriy Lyubimov To: dev@mahout.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org what i usually do for small products is to accumulate them in the map without combiners and then send them to reducers to combine there. Using combiners creates local sort for something you don't really want to do unless you can't keep it in memory . you also have a choice to force one reducer to ensure one product. Or (what i do) build a loader that just goes all vectors and reduce them on the fly whenever it is needed. MR is fundamentally too often not so cool when it comes to linear algebra operations. This is a major deficiency as far as i see it. Because it forces to sort output which already have inner structure that one can capitalize on. Simple example: suppose you want to deinterlace a TV signal. So you have two fields, one for odd lines, and another for even lines. So naturally, what you do is just create a simple mering algorithm that reads lines from each of inputs intermittently and outputs one single deinterlaced frame. O(n). now, mapreduce way to deal with it is to shuffle and sort all lines by their line No and then output total order (O(n*logN) so dealing with large blocks is like deinterlacing. you sort it but you absolutely don't have to because you already have structure in the output (even if blocks are not fitting in the memory). so hence. On Mon, Dec 5, 2011 at 1:02 PM, Raphael Cendrillon wrote: > Thanks for clarifying. I think we're all on the same page on this, althou= gh using different terms. I'll package up the job I currently have for this= and submit a patch. > > By the way, currently I have the rows being added at the combiner, and th= en the results of the combiners added in a single reducer. Do you think thi= s is sufficient, or should multiple reducers be used (per column) to furthe= r spread the load? > > On Dec 5, 2011, at 11:38 AM, Dmitriy Lyubimov wrote: > >> ok column-wise mean. (the mean of all rows). >> >> On Mon, Dec 5, 2011 at 11:00 AM, Ted Dunning wro= te: >>> Row-wise mean usually means that a mean of each row is computed. >>> >>> I think that most PCA users would want column-wise means for subtractio= n. >>> >>> On Mon, Dec 5, 2011 at 10:58 AM, Dmitriy Lyubimov w= rote: >>> >>>> We probably need =A0row wise mean computation job anyway as a separate= mr >>>> step. Wanna take a stab? >>>> On Dec 5, 2011 10:34 AM, "Raphael Cendrillon" >>>> wrote: >>>> >>>>> Given that this request seems to come up frequently, would it be wort= h >>>>> putting this approach under mahout-examples? =A0Initially it could us= e the >>>>> brute force approach together with SSVD, and updated later once suppo= rt >>>> is >>>>> ready for mean-subtraction within SSVD. >>>>> >>>>> I could put something together if there's interest. >>>>> >>>>> On Mon, Dec 5, 2011 at 9:40 AM, Dmitriy Lyubimov >>>>> wrote: >>>>> >>>>>> I am working on the addtions to ssvd algorithms and the mods to curr= ent >>>>>> solver will probably emerge in a matter of a month, my schedule >>>>> permitting. >>>>>> >>>>>> However, a brute force approach is already possible. If your input i= s >>>> of >>>>>> moderate size, or if it is already dense, you could compute median a= nd >>>>>> substract it yourself very easily and then shove it into ssvd solver >>>>> while >>>>>> requesting to produce either u or v depending if subtract column wis= e >>>> or >>>>>> row wise mean. >>>>>> >>>>>> The only problem with brute force approach is that it would densify >>>>>> originally sparse input. Depending on your problem and # of machine >>>> nodes >>>>>> you can spare, it may or may not be a problem. >>>>>> On Dec 4, 2011 7:59 PM, "magicalo" wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Is there an expected release date for the PCA algorithm as part of >>>>>> Mahout? >>>>>>> Tx! >>>>>>> >>>>>>> >>>>>> >>>>> >>>>