Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 35FD0DAC2 for ; Mon, 15 Oct 2012 12:54:11 +0000 (UTC) Received: (qmail 44179 invoked by uid 500); 15 Oct 2012 12:54:05 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 41825 invoked by uid 500); 15 Oct 2012 12:54:01 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 41800 invoked by uid 99); 15 Oct 2012 12:54:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Oct 2012 12:54:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dechouxb@gmail.com designates 209.85.216.179 as permitted sender) Received: from [209.85.216.179] (HELO mail-qc0-f179.google.com) (209.85.216.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Oct 2012 12:53:55 +0000 Received: by mail-qc0-f179.google.com with SMTP id b14so3912398qcs.38 for ; Mon, 15 Oct 2012 05:53:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=9BDc9KkWEzhX0JysHesOLqkCLksEPxkyMeDvfcMYpos=; b=Ezutx3g41lOZ4CxD+JQ7FZt9KNs9glVXruFG1UDF0iqTC9PN7L/jCZcLRNV0dUrliG 4cNJTqIGFgLs2Vz/X0ffzjjsreq7eMibh2LTL+i4g1zryM1UXbwXr+3oPxYnoPw3Ly7a e9TZBvRHmxYhbW897bZtQ+gkK8865bAm5k53vqU2VeRgJRwpkEsBNzAqrZqeZaHChT4u vPdXqSs5MT30Su332F18MMJ3lp2jwr5SHSjmHN+G9PhsnLWjaIX1bLqAi0ITCjLRigDD hRveT6CqmVhGfeSRKJnufN3IpdZnMtkbRRgL2jPLbGI6/diOxn6dKBDNtjVJHn5MQNeO RrhA== MIME-Version: 1.0 Received: by 10.49.105.229 with SMTP id gp5mr26893072qeb.35.1350305614842; Mon, 15 Oct 2012 05:53:34 -0700 (PDT) Received: by 10.49.71.231 with HTTP; Mon, 15 Oct 2012 05:53:34 -0700 (PDT) In-Reply-To: References: Date: Mon, 15 Oct 2012 14:53:34 +0200 Message-ID: Subject: Re: Logistic regression package on Hadoop From: Bertrand Dechoux To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b6da13482ca4d04cc188697 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b6da13482ca4d04cc188697 Content-Type: text/plain; charset=ISO-8859-1 Hi Rajesh, You may want to use the mahout mailing list for mahout related question. http://mahout.apache.org/mailinglists.html Regards Bertrand On Mon, Oct 15, 2012 at 2:34 PM, Rajesh Nikam wrote: > Hi Harsh, > > Thanks for giving link for sgd from mahout. > > I have asked question on issue with using sgd. Below is description of it. > Ted Dunning has mentioned their may be some issue with data encoding. > > However I am not able to point issue. Could you please let me know what is > issue its format or usage. > > Attached uses input files > > I am using Iris Plants Database from Michael Marshall. PFA iris.arff. > Converted this to csv file just by updating header: iris-3-classes.csv > > mahout org.apache.mahout.classifier. > sgd.TrainLogistic --input /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output /usr/local/mahout/trunk/ > *iris-3-classes.model* --target class *--categories 3* --predictors > sepallength sepalwidth petallength petalwidth --types n > > >> it gave following error. > Exception in thread "main" java.lang.IllegalArgumentException: Can only > call classifyScalar with two categories > > Now created csv with only 2 classes. PFA iris-2-classes.csv > > >> trained iris-2-classes.csv with sgd > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories > 2* --predictors sepallength sepalwidth petallength petalwidth --types n > > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion > > AUC = 0.14 > confusion: [[50.0, 50.0], [0.0, 0.0]] > entropy: [[-0.6, -0.3], [-0.8, -0.4]] > > >> AUC seems to poor. Now changed --predictors > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories > 2* --predictors sepalwidth petallength --types n n > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion > --scores > > AUC = 0.80 > confusion: [[50.0, 50.0], [0.0, 0.0]] > entropy: [[-0.7, -0.3], [-0.7, -0.4]] > > AUC is improved, however from confusion matrix seems everything is > classified as class a. > > Below is the output. > > "target","model-output","log-likelihood" > 0,0.492,-0.677017 > 0,0.493,-0.679192 > 0,0.493,-0.678355 > 0,0.493,-0.678724 > 0,0.492,-0.676583 > 0,0.491,-0.675182 > 0,0.492,-0.677452 > 0,0.492,-0.677419 > 0,0.493,-0.679628 > 0,0.493,-0.678724 > 0,0.491,-0.676116 > 0,0.492,-0.677386 > 0,0.493,-0.679192 > 0,0.493,-0.679291 > 0,0.491,-0.674912 > 0,0.490,-0.673081 > 0,0.491,-0.675313 > 0,0.492,-0.677017 > 0,0.491,-0.675616 > 0,0.491,-0.675682 > 0,0.492,-0.677353 > 0,0.491,-0.676116 > 0,0.492,-0.676714 > 0,0.492,-0.677788 > 0,0.492,-0.677287 > 0,0.493,-0.679126 > 0,0.492,-0.677386 > 0,0.492,-0.676984 > 0,0.492,-0.677452 > 0,0.492,-0.678256 > 0,0.493,-0.678691 > 0,0.492,-0.677419 > 0,0.491,-0.674381 > 0,0.490,-0.673980 > 0,0.493,-0.678724 > 0,0.493,-0.678387 > 0,0.492,-0.677050 > 0,0.493,-0.678724 > 0,0.493,-0.679225 > 0,0.492,-0.677419 > 0,0.492,-0.677050 > 0,0.495,-0.682279 > 0,0.493,-0.678355 > 0,0.492,-0.676951 > 0,0.491,-0.675550 > 0,0.493,-0.679192 > 0,0.491,-0.675649 > 0,0.493,-0.678322 > 0,0.491,-0.676116 > 0,0.492,-0.677887 > 1,0.492,-0.709316 > 1,0.492,-0.709248 > 1,0.492,-0.708935 > 1,0.494,-0.705048 > 1,0.493,-0.707488 > 1,0.493,-0.707454 > 1,0.492,-0.709765 > 1,0.494,-0.705258 > 1,0.493,-0.707936 > 1,0.493,-0.706803 > 1,0.495,-0.703539 > 1,0.493,-0.708249 > 1,0.494,-0.704601 > 1,0.493,-0.707970 > 1,0.493,-0.707597 > 1,0.492,-0.708765 > 1,0.492,-0.708351 > 1,0.493,-0.706871 > 1,0.494,-0.704770 > 1,0.494,-0.705908 > 1,0.492,-0.709350 > 1,0.493,-0.707285 > 1,0.493,-0.706247 > 1,0.493,-0.707522 > 1,0.493,-0.707835 > 1,0.492,-0.708317 > 1,0.493,-0.707556 > 1,0.492,-0.708520 > 1,0.493,-0.707902 > 1,0.494,-0.706220 > 1,0.494,-0.705427 > 1,0.494,-0.705393 > 1,0.493,-0.706803 > 1,0.493,-0.707210 > 1,0.492,-0.708351 > 1,0.492,-0.710146 > 1,0.492,-0.708867 > 1,0.494,-0.705183 > 1,0.493,-0.708215 > 1,0.494,-0.705942 > 1,0.493,-0.706525 > 1,0.492,-0.708385 > 1,0.493,-0.706389 > 1,0.494,-0.704811 > 1,0.493,-0.706905 > 1,0.493,-0.708249 > 1,0.493,-0.707801 > 1,0.493,-0.707835 > 1,0.494,-0.705604 > 1,0.493,-0.707319 > > AUC = 0.80 > confusion: [[50.0, 50.0], [0.0, 0.0]] > entropy: [[-0.7, -0.3], [-0.7, -0.4]] > > > On Fri, Oct 12, 2012 at 10:51 PM, Ted Dunning wrote: > >> Harsh, >> >> THanks for the plug. Rajesh has been talking to us. >> >> >> On Fri, Oct 12, 2012 at 8:36 AM, Harsh J wrote: >> >>> Hi Rajesh, >>> >>> Please head over to the Apache Mahout project. See >>> https://cwiki.apache.org/MAHOUT/logistic-regression.html >>> >>> Apache Mahout is homed at http://mahout.apache.org and works well with >>> Hadoop MR, etc.. >>> >>> On Fri, Oct 12, 2012 at 6:36 PM, Rajesh Nikam >>> wrote: >>> > Hi, >>> > >>> > Could you please suggest Logistic regression package that could be >>> used on >>> > Hadoop ? >>> > I have large data and looking for LR package with kernel supports. >>> > >>> > Thanks >>> > Rajesh >>> > >>> > >>> >>> >>> >>> -- >>> Harsh J >>> >> >> > -- Bertrand Dechoux --047d7b6da13482ca4d04cc188697 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Rajesh,

You may want to use the mahout mailing list for mahout re= lated question.
h= ttp://mahout.apache.org/mailinglists.html

Regards

Bertran= d

On Mon, Oct 15, 2012 at 2:34 PM, Rajesh Nika= m <rajeshnikam@gmail.com> wrote:
Hi Harsh,

Thanks for giving link for sgd from mahout.

I have = asked question on issue with using sgd. Below is description of it.
Ted = Dunning has mentioned their may be some issue with data encoding.

However I am not able to point issue. Could you please let me know what is = issue its format or usage.

Attached uses input files

I am usi= ng Iris Plants Database from Michael Marshall. PFA iris.arff.
Converted = this to csv file just by updating header: iris-3-classes.csv

mahout org.apache.mahout.classifier.
sgd.TrainLogistic --input /usr= /local/mahout/trunk/iris-3-classes.csv --features 4 --output /usr/lo= cal/mahout/trunk/iris-3-classes.model --target class --categories 3 --predictors sepallength s= epalwidth petallength petalwidth --types n

>> it gave following error.
Exception in thread "main&quo= t; java.lang.IllegalArgumentException: C= an only call classifyScalar with two categories

Now created c= sv with only 2 classes. PFA iris-2-classes.csv

>> trained iris-2-classes.csv with sgd

mahout org.apache.m= ahout.classifier.sgd.TrainLogistic --input /usr/local/mahout/trunk/iris-= 2-classes.csv --features 4 --output /usr/local/mahout/trunk/iris-2-c= lasses.model --target class --cat= egories 2 --predictors sepallength sepalwidth petallength petalw= idth --types n


mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.c= sv --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
AUC =3D 0.14
confusion: [[50.0, 50.0], [0.0, 0.0]]
entropy: [[-= 0.6, -0.3], [-0.8, -0.4]]

>> AUC seems to poor. Now changed --predictors

mahout org= .apache.mahout.classifier.sgd.TrainLogistic --input /usr/local/mahout/trunk= /iris-2-classes.csv --features 4 --output /usr/local/mahout/trunk/iris-2-classes.model --target class --categories 2 --predictors sepalwidth petallength --types n= n

mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv= =20 --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion --sc= ores

AUC =3D 0.80
confusion: [[50.0, 50.0], [0.0, 0.0]]
entrop= y: [[-0.7, -0.3], [-0.7, -0.4]]

AUC is improved, however from confus= ion matrix seems everything is classified as class a.

Below is the output.

"target","model-output"= ;,"log-likelihood"
0,0.492,-0.677017
0,0.493,-0.679192
0= ,0.493,-0.678355
0,0.493,-0.678724
0,0.492,-0.676583
0,0.491,-0.67= 5182
0,0.492,-0.677452
0,0.492,-0.677419
0,0.493,-0.679628
0,0.493,-0.6= 78724
0,0.491,-0.676116
0,0.492,-0.677386
0,0.493,-0.679192
0,0= .493,-0.679291
0,0.491,-0.674912
0,0.490,-0.673081
0,0.491,-0.6753= 13
0,0.492,-0.677017
0,0.491,-0.675616
0,0.491,-0.675682
0,0.492,-0.6= 77353
0,0.491,-0.676116
0,0.492,-0.676714
0,0.492,-0.677788
0,0= .492,-0.677287
0,0.493,-0.679126
0,0.492,-0.677386
0,0.492,-0.6769= 84
0,0.492,-0.677452
0,0.492,-0.678256
0,0.493,-0.678691
0,0.492,-0.6= 77419
0,0.491,-0.674381
0,0.490,-0.673980
0,0.493,-0.678724
0,0= .493,-0.678387
0,0.492,-0.677050
0,0.493,-0.678724
0,0.493,-0.6792= 25
0,0.492,-0.677419
0,0.492,-0.677050
0,0.495,-0.682279
0,0.493,-0.6= 78355
0,0.492,-0.676951
0,0.491,-0.675550
0,0.493,-0.679192
0,0= .491,-0.675649
0,0.493,-0.678322
0,0.491,-0.676116
0,0.492,-0.6778= 87
1,0.492,-0.709316
1,0.492,-0.709248
1,0.492,-0.708935
1,0.494,-0.7= 05048
1,0.493,-0.707488
1,0.493,-0.707454
1,0.492,-0.709765
1,0= .494,-0.705258
1,0.493,-0.707936
1,0.493,-0.706803
1,0.495,-0.7035= 39
1,0.493,-0.708249
1,0.494,-0.704601
1,0.493,-0.707970
1,0.493,-0.7= 07597
1,0.492,-0.708765
1,0.492,-0.708351
1,0.493,-0.706871
1,0= .494,-0.704770
1,0.494,-0.705908
1,0.492,-0.709350
1,0.493,-0.7072= 85
1,0.493,-0.706247
1,0.493,-0.707522
1,0.493,-0.707835
1,0.492,-0.7= 08317
1,0.493,-0.707556
1,0.492,-0.708520
1,0.493,-0.707902
1,0= .494,-0.706220
1,0.494,-0.705427
1,0.494,-0.705393
1,0.493,-0.7068= 03
1,0.493,-0.707210
1,0.492,-0.708351
1,0.492,-0.710146
1,0.492,-0.7= 08867
1,0.494,-0.705183
1,0.493,-0.708215
1,0.494,-0.705942
1,0= .493,-0.706525
1,0.492,-0.708385
1,0.493,-0.706389
1,0.494,-0.7048= 11
1,0.493,-0.706905
1,0.493,-0.708249
1,0.493,-0.707801
1,0.493,-0.7= 07835
1,0.494,-0.705604
1,0.493,-0.707319

AUC =3D 0.80
conf= usion: [[50.0, 50.0], [0.0, 0.0]]
entropy: [[-0.7, -0.3], [-0.7, -0.4]]<= /div>

On Fri, Oct 12, 2012 at 10:51 PM, Ted Du= nning <tdunning@maprtech.com> wrote:
Harsh,

THanks for the plug. =A0Rajesh has been talking t= o us.


On Fri, Oct 12, 2012 at 8= :36 AM, Harsh J <harsh@cloudera.com> wrote:
Hi Rajesh,

Please head over to the Apache Mahout project. See
https://cwiki.apache.org/MAHOUT/logistic-regression.html
Apache Mahout is homed at http://mahout.apache.org and works well with
Hadoop MR, etc..

On Fri, Oct 12, 2012 at 6:36 PM, Rajesh Nikam <rajeshnikam@gmail.com> wrote:
> Hi,
>
> Could you please suggest Logistic regression package that could be use= d on
> Hadoop ?
> I have large data and looking for LR package with kernel supports.
>
> Thanks
> Rajesh
>
>



--
Harsh J





--
Bertrand Dechoux
--047d7b6da13482ca4d04cc188697--