Thanks for looking into it.
Actually first I have tried it with big data. Below was model info for it.
AUC = 0.50
confusion: [[1252978.0, 23003.0], [0.0, 0.0]]
entropy: [[0.0, 0.0], [46.1, 0.8]]
Looking forward for your comments.
Thanks
Rajesh
On Wed, Oct 10, 2012 at 8:08 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Sgd is more suitable for large data. I will take a look later today.
>
> Sent from my iPhone
>
> On Oct 9, 2012, at 11:29 PM, Rajesh Nikam <rajeshnikam@gmail.com> wrote:
>
> > Hi Ted,
> >
> > Putting specific question with data for getting problem with SGD.
> >
> > I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
> >
> > Converted this to csv file just by updating header: iris3classes.csv
> >
> > mahout org.apache.mahout.classifier.sgd.TrainLogistic input
> /usr/local/mahout/trunk/iris3classes.csv features 4 output
> /usr/local/mahout/trunk/iris3classes.model target class categories 3
> predictors sepallength sepalwidth petallength petalwidth types n n
> >
> > >> it gave following error.
> > Exception in thread "main" java.lang.IllegalArgumentException: Can only
> call classifyScalar with two categories
> >
> > Now created csv with only 2 classes. PFA iris2classes.csv
> >
> > >> trained iris2classes.csv with sgd
> >
> > mahout org.apache.mahout.classifier.sgd.TrainLogistic input
> /usr/local/mahout/trunk/iris2classes.csv features 4 output
> /usr/local/mahout/trunk/iris2classes.model target class categories 2
> predictors sepallength sepalwidth petallength petalwidth types n n
> >
> >
> > mahout runlogistic input /usr/local/mahout/trunk/iris2classes.csv
> model /usr/local/mahout/trunk/iris2classes.model auc confusion
> >
> > AUC = 0.14
> > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > entropy: [[0.6, 0.3], [0.8, 0.4]]
> >
> > >> AUC seems to poor. Now changed predictors
> >
> > mahout org.apache.mahout.classifier.sgd.TrainLogistic input
> /usr/local/mahout/trunk/iris2classes.csv features 4 output
> /usr/local/mahout/trunk/iris2classes.model target class categories 2
> predictors sepalwidth petallength types n n
> >
> > mahout runlogistic input /usr/local/mahout/trunk/iris2classes.csv
> model /usr/local/mahout/trunk/iris2classes.model auc confusion
> scores
> >
> > AUC = 0.80
> > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > entropy: [[0.7, 0.3], [0.7, 0.4]]
> >
> > AUC is improved, however from confusion matrix seems everything is
> classified as class a.
> >
> > Below is the output.
> >
> > "target","modeloutput","loglikelihood"
> > 0,0.492,0.677017
> > 0,0.493,0.679192
> > 0,0.493,0.678355
> > 0,0.493,0.678724
> > 0,0.492,0.676583
> > 0,0.491,0.675182
> > 0,0.492,0.677452
> > 0,0.492,0.677419
> > 0,0.493,0.679628
> > 0,0.493,0.678724
> > 0,0.491,0.676116
> > 0,0.492,0.677386
> > 0,0.493,0.679192
> > 0,0.493,0.679291
> > 0,0.491,0.674912
> > 0,0.490,0.673081
> > 0,0.491,0.675313
> > 0,0.492,0.677017
> > 0,0.491,0.675616
> > 0,0.491,0.675682
> > 0,0.492,0.677353
> > 0,0.491,0.676116
> > 0,0.492,0.676714
> > 0,0.492,0.677788
> > 0,0.492,0.677287
> > 0,0.493,0.679126
> > 0,0.492,0.677386
> > 0,0.492,0.676984
> > 0,0.492,0.677452
> > 0,0.492,0.678256
> > 0,0.493,0.678691
> > 0,0.492,0.677419
> > 0,0.491,0.674381
> > 0,0.490,0.673980
> > 0,0.493,0.678724
> > 0,0.493,0.678387
> > 0,0.492,0.677050
> > 0,0.493,0.678724
> > 0,0.493,0.679225
> > 0,0.492,0.677419
> > 0,0.492,0.677050
> > 0,0.495,0.682279
> > 0,0.493,0.678355
> > 0,0.492,0.676951
> > 0,0.491,0.675550
> > 0,0.493,0.679192
> > 0,0.491,0.675649
> > 0,0.493,0.678322
> > 0,0.491,0.676116
> > 0,0.492,0.677887
> > 1,0.492,0.709316
> > 1,0.492,0.709248
> > 1,0.492,0.708935
> > 1,0.494,0.705048
> > 1,0.493,0.707488
> > 1,0.493,0.707454
> > 1,0.492,0.709765
> > 1,0.494,0.705258
> > 1,0.493,0.707936
> > 1,0.493,0.706803
> > 1,0.495,0.703539
> > 1,0.493,0.708249
> > 1,0.494,0.704601
> > 1,0.493,0.707970
> > 1,0.493,0.707597
> > 1,0.492,0.708765
> > 1,0.492,0.708351
> > 1,0.493,0.706871
> > 1,0.494,0.704770
> > 1,0.494,0.705908
> > 1,0.492,0.709350
> > 1,0.493,0.707285
> > 1,0.493,0.706247
> > 1,0.493,0.707522
> > 1,0.493,0.707835
> > 1,0.492,0.708317
> > 1,0.493,0.707556
> > 1,0.492,0.708520
> > 1,0.493,0.707902
> > 1,0.494,0.706220
> > 1,0.494,0.705427
> > 1,0.494,0.705393
> > 1,0.493,0.706803
> > 1,0.493,0.707210
> > 1,0.492,0.708351
> > 1,0.492,0.710146
> > 1,0.492,0.708867
> > 1,0.494,0.705183
> > 1,0.493,0.708215
> > 1,0.494,0.705942
> > 1,0.493,0.706525
> > 1,0.492,0.708385
> > 1,0.493,0.706389
> > 1,0.494,0.704811
> > 1,0.493,0.706905
> > 1,0.493,0.708249
> > 1,0.493,0.707801
> > 1,0.493,0.707835
> > 1,0.494,0.705604
> > 1,0.493,0.707319
> >
> > AUC = 0.80
> > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > entropy: [[0.7, 0.3], [0.7, 0.4]]
> >
> > SGD is suitable for what kind of data?
> >
> > Thanks,
> > Rajesh
> >
> >
> > <iris2classes.csv>
> > <iris3classes.csv>
>
