Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 94426C783 for ; Mon, 10 Mar 2014 14:48:51 +0000 (UTC) Received: (qmail 49056 invoked by uid 500); 10 Mar 2014 14:48:49 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 48601 invoked by uid 500); 10 Mar 2014 14:48:47 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 48591 invoked by uid 99); 10 Mar 2014 14:48:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Mar 2014 14:48:45 +0000 X-ASF-Spam-Status: No, hits=2.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dlieu.7@gmail.com designates 209.85.214.171 as permitted sender) Received: from [209.85.214.171] (HELO mail-ob0-f171.google.com) (209.85.214.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Mar 2014 14:48:38 +0000 Received: by mail-ob0-f171.google.com with SMTP id wn1so7087096obc.2 for ; Mon, 10 Mar 2014 07:48:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=2EjmB7PDhJ7yA+V+TS4L/HoREwdQTmA7hU2NiFQIj8k=; b=qTbxC59U30WnFBgCMOroU/3wloRY9nZBuCn0/LUktYOOI+hu0ES6mYgxBA3aYX6NEy d4JFESjT6ROl/hcgECmilr0AMpWOK84MQfiCBCNGaKCzFCUK79nHth6LTNFIg/QYO89F ZEzAhMqO9E1WoBECjZN9kWgLNhy+7XClUXr36+A9aFdbJwxEqRwiQwu23bDYLBYnSm3C Rw9Kt3LHvTOmbjYKvTHiWBoIARTrPG871UZzk67iit2teU8c0FlVEqKpcqUVaR/Lmmdg pRmqJ8u7PfWK+hNh/Kj9q0WiP9ueoHu2oOeYiqENbN9B5Fealfhs/k3NmfP0pu0RS+CC AWYQ== MIME-Version: 1.0 X-Received: by 10.60.60.33 with SMTP id e1mr24702957oer.36.1394462897096; Mon, 10 Mar 2014 07:48:17 -0700 (PDT) Received: by 10.76.34.199 with HTTP; Mon, 10 Mar 2014 07:48:17 -0700 (PDT) Received: by 10.76.34.199 with HTTP; Mon, 10 Mar 2014 07:48:17 -0700 (PDT) In-Reply-To: References: <1394205819.86559.YahooMailNeo@web163505.mail.gq1.yahoo.com> <1394441157.16294.YahooMailNeo@web163505.mail.gq1.yahoo.com> Date: Mon, 10 Mar 2014 07:48:17 -0700 Message-ID: Subject: Re: PCA to improve classification performances From: Dmitriy Lyubimov To: user@mahout.apache.org Cc: Suneel Marthi Content-Type: multipart/alternative; boundary=089e013c5d2ea2508204f441b11a X-Virus-Checked: Checked by ClamAV on apache.org --089e013c5d2ea2508204f441b11a Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Pca and ssvd propagates exact row keys given in the input. If you give it text keys, U and Usigma will have text keys. It doesn t change that. On Mar 10, 2014 3:39 AM, "Kevin Moulart" wrote: > Hi and thanks, I'll try that, but I'd like to do so using a mapreduce job > to improve performances. > > I'm using PCA as a way to reduce the dimension of the dataset both to > improve its relevance (with 1600+ variables, many of them are correlated) > and to improve the performances of the classification algorithm used. > > > > K=E9vin Moulart > > > 2014-03-10 9:45 GMT+01:00 Suneel Marthi : > > > > > > > > > On Monday, March 10, 2014 4:21 AM, Kevin Moulart < > kevinmoulart@gmail.com> > > wrote: > > > > Its not clear to me from ur description as to the exact sequence of ste= ps > > u r running thru, but an SSVD job requires a matrix as input (not a > > sequencefile of . > > When u try running a seqdumper on ur SSVD output do u see anything? > > > > > > I see a Seqence File Text/VectorWritable with my original keys, and 99 > > valuesfor each element in my original dataset. > > > > The next step after u create ur sequencefiles of Vectors would be to ru= n > > the rowId job to generate a matrix and docIndex. > > > > This matrix needs to be the input to SSVD (for dimensional reduction), > > > > > > Ok so I tried that and indeed the SSVD accepts the matrix as input and > > gives me a Sequence File IntWritable/VectorWritable. > > > > > > followed by train Naive Bayes and test Naive Bayes. > > > > > > Here it doesn't work anymore, the NB wants a Sequence File > > Text/VectorWritable, and it won't take the one created hereabove. > > Is there a counterpart to rowId that takes a matrix and docIndex output= s > > the SequenceFile ? > > > > >> Hmm... not that I know of. You are gonna have to write a utility > that > > reads docIndex and as inputs. > > a) Create a dictionary of documentId, documentName from docIndex > > b) > > (i) Read the Pair from the > > sequencefile, > > (ii) for each pair, read the key and value > > { > > replace each key with the corresponding DocumentName > > from dictionary in (a) > > SequenceFile,Writer.write(Text, VectorWritable) > > } > > > > Question: I might have missed it but what's the reason again u r > > calling PCA for before running TrainNaiveBayes ? > > > > If others, have a better ideas please feel free to comment. > > > > > > K=E9vin Moulart > > > > > > 2014-03-07 16:23 GMT+01:00 Suneel Marthi : > > > > Its not clear to me from ur description as to the exact sequence of ste= ps > > u r running thru, but an SSVD job requires a matrix as input (not a > > sequencefile of . > > > > When u try running a seqdumper on ur SSVD output do u see anything? > > > > The next step after u create ur sequencefiles of Vectors would be to ru= n > > the rowId job to generate a matrix and docIndex. > > > > This matrix needs to be the input to SSVD (for dimensional reduction), > > followed by train Naive Bayes and test Naive Bayes. > > > > > > > > > > > > On Friday, March 7, 2014 10:01 AM, Kevin Moulart > > > wrote: > > > > Hi again, > > > > I'm now using Mahout 0.9, and I'm trying to use PCA (via the SSVD) to > > reduce the dimention of a dataset from 1600+ features to ~100 and then = to > > use the reducted dataset to train a naive bayes model and test it. > > > > So here is my workflow : > > > > - Transform my CSV into a SequencFile with > > > > key =3D class as Text (with a "/" in it to be accepted by NaiveBayes, s= o in > > the for "class/class") using a custom job in MapReduce. > > > > value =3D features as VectorWritable > > > > - Use mahout command line to reduce the dimension of the dataset : > > > > mahout ssvd -i /user/myCompny/Echant/echant100k.seq -o > > /user/myCompany/Echant/echant100k_red.seq --rank 100 -us -V false -U tr= ue > > -pca -ow -t 3 > > > > =3D=3D> Here I get - if I understand things correctly - U, being the re= ducted > > dataset. > > > > - Use mahout command line to train the NaiveBayes model : > > > > mahout trainnb -i /user/myCompany/Echant/echant100k_red.seq/U -o > > /user/myCompany/Echant/echant100k_red.model -l 0,1 > > -li /user/myCompany/Echant/labelIndex100k_red -ow > > > > > > - Use mahout command line to test the generated model : > > > > mahout testnb > > -i /user/myCompany/Echant/echant100k_red.seq/U --model > > /user/myCompany/Echant/echant100k_red.model -ow > > -o /user/myCompany/Echant/predicted_echant100k --labelIndex > > /user/myCompany/Echant/labelIndex100k_red > > > > (Here I test with the same dataset, but I should try with other dataset= s > as > > well once it runs smoothly) > > > > Here is my problem, everything seems to work quite well until I test my > > model : the output is full of NaN : > > > > > > Key: 1: Value: {0:NaN,1:NaN} > > Key: 1: Value: {0:NaN,1:NaN} > > Key: 0: Value: {0:NaN,1:NaN} > > Key: 0: Value: {0:NaN,1:NaN} > > Key: 1: Value: {0:NaN,1:NaN} > > Key: 0: Value: {0:NaN,1:NaN} > > Key: 1: Value: {0:NaN,1:NaN} > > Key: 0: Value: {0:NaN,1:NaN} > > Key: 0: Value: {0:NaN,1:NaN} > > Key: 0: Value: {0:NaN,1:NaN} > > Key: 1: Value: {0:NaN,1:NaN} > > > > > > I already have the same problem when training and testing with the full > > dataset but there, about 15% of the data still has values in output and > > gets predicted, the rest being NaN and unpredicted. > > > > Could you help me see what could be causing that ? > > > > Thanks in advance > > Bests, > > > > K=E9vin Moulart > > > > > > > > > > > --089e013c5d2ea2508204f441b11a--