hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajesh Nikam <rajeshni...@gmail.com>
Subject Re: Logistic regression package on Hadoop
Date Mon, 15 Oct 2012 12:34:04 GMT
Hi Harsh,

Thanks for giving link for sgd from mahout.

I have asked question on issue with using sgd. Below is description of it.
Ted Dunning has mentioned their may be some issue with data encoding.

However I am not able to point issue. Could you please let me know what is
issue its format or usage.

Attached uses input files

I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
Converted this to csv file just by updating header: iris-3-classes.csv

mahout org.apache.mahout.classifier.
sgd.TrainLogistic --input
/usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output
/usr/local/mahout/trunk/
*iris-3-classes.model* --target class *--categories 3* --predictors
sepallength sepalwidth petallength petalwidth --types n

>> it gave following error.
Exception in thread "main" java.lang.IllegalArgumentException: Can only
call classifyScalar with two categories

Now created csv with only 2 classes. PFA iris-2-classes.csv

>> trained iris-2-classes.csv with sgd

mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
/usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
/usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories
2* --predictors sepallength sepalwidth petallength petalwidth --types n


mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
--model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion

AUC = 0.14
confusion: [[50.0, 50.0], [0.0, 0.0]]
entropy: [[-0.6, -0.3], [-0.8, -0.4]]

>> AUC seems to poor. Now changed --predictors

mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
/usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
/usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories
2* --predictors sepalwidth petallength --types n n

mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
--model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
--scores

AUC = 0.80
confusion: [[50.0, 50.0], [0.0, 0.0]]
entropy: [[-0.7, -0.3], [-0.7, -0.4]]

AUC is improved, however from confusion matrix seems everything is
classified as class a.

Below is the output.

"target","model-output","log-likelihood"
0,0.492,-0.677017
0,0.493,-0.679192
0,0.493,-0.678355
0,0.493,-0.678724
0,0.492,-0.676583
0,0.491,-0.675182
0,0.492,-0.677452
0,0.492,-0.677419
0,0.493,-0.679628
0,0.493,-0.678724
0,0.491,-0.676116
0,0.492,-0.677386
0,0.493,-0.679192
0,0.493,-0.679291
0,0.491,-0.674912
0,0.490,-0.673081
0,0.491,-0.675313
0,0.492,-0.677017
0,0.491,-0.675616
0,0.491,-0.675682
0,0.492,-0.677353
0,0.491,-0.676116
0,0.492,-0.676714
0,0.492,-0.677788
0,0.492,-0.677287
0,0.493,-0.679126
0,0.492,-0.677386
0,0.492,-0.676984
0,0.492,-0.677452
0,0.492,-0.678256
0,0.493,-0.678691
0,0.492,-0.677419
0,0.491,-0.674381
0,0.490,-0.673980
0,0.493,-0.678724
0,0.493,-0.678387
0,0.492,-0.677050
0,0.493,-0.678724
0,0.493,-0.679225
0,0.492,-0.677419
0,0.492,-0.677050
0,0.495,-0.682279
0,0.493,-0.678355
0,0.492,-0.676951
0,0.491,-0.675550
0,0.493,-0.679192
0,0.491,-0.675649
0,0.493,-0.678322
0,0.491,-0.676116
0,0.492,-0.677887
1,0.492,-0.709316
1,0.492,-0.709248
1,0.492,-0.708935
1,0.494,-0.705048
1,0.493,-0.707488
1,0.493,-0.707454
1,0.492,-0.709765
1,0.494,-0.705258
1,0.493,-0.707936
1,0.493,-0.706803
1,0.495,-0.703539
1,0.493,-0.708249
1,0.494,-0.704601
1,0.493,-0.707970
1,0.493,-0.707597
1,0.492,-0.708765
1,0.492,-0.708351
1,0.493,-0.706871
1,0.494,-0.704770
1,0.494,-0.705908
1,0.492,-0.709350
1,0.493,-0.707285
1,0.493,-0.706247
1,0.493,-0.707522
1,0.493,-0.707835
1,0.492,-0.708317
1,0.493,-0.707556
1,0.492,-0.708520
1,0.493,-0.707902
1,0.494,-0.706220
1,0.494,-0.705427
1,0.494,-0.705393
1,0.493,-0.706803
1,0.493,-0.707210
1,0.492,-0.708351
1,0.492,-0.710146
1,0.492,-0.708867
1,0.494,-0.705183
1,0.493,-0.708215
1,0.494,-0.705942
1,0.493,-0.706525
1,0.492,-0.708385
1,0.493,-0.706389
1,0.494,-0.704811
1,0.493,-0.706905
1,0.493,-0.708249
1,0.493,-0.707801
1,0.493,-0.707835
1,0.494,-0.705604
1,0.493,-0.707319

AUC = 0.80
confusion: [[50.0, 50.0], [0.0, 0.0]]
entropy: [[-0.7, -0.3], [-0.7, -0.4]]


On Fri, Oct 12, 2012 at 10:51 PM, Ted Dunning <tdunning@maprtech.com> wrote:

> Harsh,
>
> THanks for the plug.  Rajesh has been talking to us.
>
>
> On Fri, Oct 12, 2012 at 8:36 AM, Harsh J <harsh@cloudera.com> wrote:
>
>> Hi Rajesh,
>>
>> Please head over to the Apache Mahout project. See
>> https://cwiki.apache.org/MAHOUT/logistic-regression.html
>>
>> Apache Mahout is homed at http://mahout.apache.org and works well with
>> Hadoop MR, etc..
>>
>> On Fri, Oct 12, 2012 at 6:36 PM, Rajesh Nikam <rajeshnikam@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Could you please suggest Logistic regression package that could be used
>> on
>> > Hadoop ?
>> > I have large data and looking for LR package with kernel supports.
>> >
>> > Thanks
>> > Rajesh
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Mime
View raw message