mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajesh Nikam <rajeshni...@gmail.com>
Subject Re: SGD: Logistic regression package in Mahout
Date Wed, 31 Oct 2012 12:14:37 GMT
Hi Ted,

Please update once JIRA and test case is uploaded.

Looking forward for your reply.

Thanks
Rajesh

On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam <rajeshnikam@gmail.com>wrote:

> Hi Ted,
>
> Thanks for reply. I will wait for JIRA and hope to get rid of any encoding
> issue.
>
> Thanks,
> Rajesh
> On Oct 31, 2012 5:24 AM, "Ted Dunning" <ted.dunning@gmail.com> wrote:
>
>> OK.  I am back up for air.
>>
>> Rajesh,
>>
>> As I am sure you know, most folks here contribute on their own time.  I
>> have been busy with my day job and unable to help with this until just
>> now.
>>
>> I just wrote a test case that looks at the Iris data set.  The results are
>> categorically different from yours.
>>
>> That substantiates my original feeling that your encoding of the data is
>> problematic.  I will file a JIRA and attach a test case that you can look
>> at.  Then we can see what the differences are.
>>
>>
>> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <rajeshnikam@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Is there development happening on fixing issue with SGD that generates
>> > models which are as good as random prediction?
>> >
>> > I am not sure why such issue is not noticed and raised by others ?
>> > May be this specific algo is not used in practical applications.
>> >
>> > Thanks,
>> > Rajesh
>> >
>> >
>> > >>
>> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <ted.dunning@gmail.com
>> > >wrote:
>> > >>
>> > >>> Rajesh,
>> > >>>
>> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes through
>> > the
>> > >>> data.  All produced identical results.  Thus it isn't an issue
of
>> SGD
>> > >>> converging.
>> > >>>
>> > >>> I also did a parameter scan of lambda and saw no effect.
>> > >>>
>> > >>> I also did the standard thing in R with glm and got the expected
>> > >>> (correct)
>> > >>> results.
>> > >>>
>> > >>> I haven't looked yet in detail, but I really suspect that the
>> reading
>> > of
>> > >>> the data is horked.  This is exactly how that behaves.
>> > >>>
>> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <
>> rajeshnikam@gmail.com>
>> > >>> wrote:
>> > >>>
>> > >>> > Hi Ted,
>> > >>> >
>> > >>> > I was thinking, this might be due to having only 100 instances
for
>> > >>> > training.
>> > >>> >
>> > >>> > So I have created test set with two classes having ~49K instances,
>> > >>> included
>> > >>> > all features as predictors.
>> > >>> > PFA sgd.grps.zip with test file.
>> > >>> >
>> > >>> > mahout trainlogistic --input
>> /usr/local/mahout/trainme/sgd-grps.csv
>> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target
class
>> > >>> > --categories 2 --features 128 --types n --predictors a1 a2
a3 a4
>> a5
>> > a6
>> > >>> a7
>> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22
a23 a24
>> a25
>> > >>> a26
>> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41
a42
>> a43
>> > >>> a44 a45
>> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60
a61
>> a62
>> > >>> a63 a64
>> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79
a80
>> a81
>> > >>> a82 a83
>> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98
a99
>> a100
>> > >>> a101
>> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113
a114
>> a115
>> > >>> a116
>> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
>> > >>> >
>> > >>> >
>> > >>> > mahout runlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
>> > >>> --model
>> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
>> > >>> >
>> > >>> > Still the results are similar, it classifies everything as
>> class_1.
>> > >>> >
>> > >>> > AUC = 0.50
>> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
>> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
>> > >>> >
>> > >>> > I am not sure why this is failing all the time.
>> > >>> >
>> > >>> > Looking forward for your reply.
>> > >>> >
>> > >>> > Thanks
>> > >>> > Rajesh
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <
>> ted.dunning@gmail.com>
>> > >>> > wrote:
>> > >>> >
>> > >>> > > I would love to help and will before long.  Just can't
do it in
>> the
>> > >>> first
>> > >>> > > part of this week.
>> > >>> > >
>> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
>> > rajeshnikam@gmail.com
>> > >>> >
>> > >>> > > wrote:
>> > >>> > >
>> > >>> > > > Hello,
>> > >>> > > >
>> > >>> > > > I have asked below question on issue with using
sgd on mahout
>> > >>> forum.
>> > >>> > > >
>> > >>> > > > Similar issue with sgd is reported by
>> > >>> > > >
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>>
>> >
>> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
>> > >>> > > >
>> > >>> > > > Even below link has similar output:
>> > >>> > > >
>> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
>> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
>> > >>> > > >
>> > >>> > > >
>> > >>> > > >
>> > >>> >
>> > >>>
>> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
>> > >>> > > >
>> > >>> > > > I am still wannder confusion how then this model
works and
>> used
>> > by
>> > >>> > many ?
>> > >>> > > > Not able to get any points on how to use SGD that
generates
>> > >>> effective
>> > >>> > > > model.
>> > >>> > > >
>> > >>> > > > Could someone point out what is missing in input
file or
>> provided
>> > >>> > > > parameters.
>> > >>> > > >
>> > >>> > > > I appreciate your help.
>> > >>> > > >
>> > >>> > > > Below is description of steps that I followed.
>> > >>> > > >
>> > >>> > > > PF Attached uses input files for experiment.
>> > >>> > > >
>> > >>> > > > I am using Iris Plants Database from Michael Marshall.
PFA
>> > >>> iris.arff.
>> > >>> > > > Converted this to csv file just by updating header:
>> > >>> iris-3-classes.csv
>> > >>> > > >
>> > >>> > > > mahout org.apache.mahout.classifier.
>> > >>> > > > sgd.TrainLogistic --input
>> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features
4
>> --output
>> > >>> > > /usr/local/mahout/trunk/
>> > >>> > > > *iris-3-classes.model* --target class *--categories
3*
>> > --predictors
>> > >>> > > > sepallength sepalwidth petallength petalwidth --types
n
>> > >>> > > >
>> > >>> > > > >> it gave following error.
>> > >>> > > > Exception in thread "main" java.lang.IllegalArgumentException:
>> > Can
>> > >>> only
>> > >>> > > > call classifyScalar with two categories
>> > >>> > > >
>> > >>> > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
>> > >>> > > >
>> > >>> > > > >> trained iris-2-classes.csv with sgd
>> > >>> > > >
>> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
--input
>> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features
4
>> > --output
>> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
class
>> > >>> > > *--categories
>> > >>> > > > 2* --predictors sepallength sepalwidth petallength
petalwidth
>> > >>> --types n
>> > >>> > > >
>> > >>> > > > mahout runlogistic --input
>> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
>> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model
--auc
>> > >>> --confusion
>> > >>> > > >
>> > >>> > > > AUC = 0.14
>> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
>> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
>> > >>> > > >
>> > >>> > > > >> AUC seems to poor. Now changed --predictors
>> > >>> > > >
>> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
--input
>> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features
4
>> > --output
>> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
class
>> > >>> > > *--categories
>> > >>> > > > 2* --predictors sepalwidth petallength --types n
>> > >>> > > >
>> > >>> > > > mahout runlogistic --input
>> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
>> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model
--auc
>> > >>> --confusion
>> > >>> > > > --scores
>> > >>> > > >
>> > >>> > > > AUC = 0.80
>> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
>> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
>> > >>> > > >
>> > >>> > > > This model classifies everything as category 1 which
of no
>> use.
>> > >>> > > >
>> > >>> > > > Thanks
>> > >>> > > > Rajesh
>> > >>> > > >
>> > >>> > > >
>> > >>> > > >
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message