mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: SGD diferent confusion matrix for each run
Date Fri, 31 Aug 2012 15:52:59 GMT
OK.

Try passing through the data 100 times for a start.  I think that this is
likely to fix your problems.

Be warned that AdaptiveLogisticRegression has been misbehaving lately and
may converge faster than it should.

On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <salman@influestor.com>wrote:

> Thanks a lot ted. Here are the answers:
> d) Data (news articles from different feeds)
>         News Article 1: Title : BP Profits Plunge On Massive Asset
> Write-down
>                                     Description :BP PLC (BP) Tuesday
> posted a dramatic fall of 96% in adjusted profit for the
> second quarter as it wrote down the value of its assets by $5 billion
> including some U.S. refineries a suspended Alaskan oil project and U.S.
> shale gas resources
>
>         News Article 2: Title : Morgan Stanley Missed Big
>                                      Description: Why It's Still A
> Fantastic Short,"By Mike Williams: Though the market responded very
> positively to Citigroup (C) and Bank of America's (BAC) reserve
> release-driven earnings ""beats"" last week's Morgan Stanley (MS) earnings
> report illustrated what happens when a bank doesn't have billions of
> reserves to release back into earnings. Estimates called for the following:
> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt value
> adjustment) $7.7 billion in revenue GAAP results (including the DVA) came
> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a
> particular disappointment coming in at $6.95 billion.
>
> c) As you can see the data is textual. and I am using title and
> description as predictor variable and the target variable is the company
> name a news belongs to.
>
> b) I am passing through the data once (at least this is what I think). I
> folowed the 20newsgroup example code(in java) and dint find that the data
> was passed more than once.
> Yes I randomize the order every time.
>
> a) I am using AdaptiveLearningRegression (just like 20newsgroup).
>
> Thanks!
>
>
>
> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
>
> > First, this is a tiny training set.  You are well outside the intended
> > application range so you are likely to find less experience in the
> > community in that range.  That said, the algorithm should still produce
> > reasonably stable results.
> >
> > Here are a few questions:
> >
> > a) which class are you using to train your model?  I would start with
> > OnlineLogisticRegression and experiment with training rate schedules and
> > amount of regularization to find out how to build a good model.
> >
> > b) how many times are you passing through your data?  Do you randomize
> the
> > order each time?  These are critical to proper training.  Instead of
> > randomizing order, you could just sample a data point at random and not
> > worry about using a complete permutation of the data.  With such a tiny
> > data set, you will need to pass through the data many times ... possibly
> > hundreds of times or more.
> >
> > c) what kind of data do you have?  Sparse?  Dense?  How many variables?
> > What kind?
> >
> > d) can you post your data?
> >
> >
> > On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <salman@influestor.com
> >wrote:
> >
> >> Thanks a lot lance. Let me elaborate the problem if it was a bit
> confusing.
> >>
> >> Assuming I am making a binary classifier using SGD. I have got 50
> positive
> >> and 50 negative examples to train the classifier. After training and
> >> testing the model, the confusion matrix tells you the number of
> correctly
> >> and incorrectly classified instances. Let's assume I got 85% correct and
> >> 15% incorrect instances.
> >>
> >> Now if I run my program again using the same 50 negative and 50 positive
> >> examples, then according to my knowledge the classifier should yield the
> >> same results as before (cause not even a single training or testing data
> >> was changed), but this is not the case. I get different results for
> >> different runs. The confusion matrix figures changes each time I
> generate a
> >> model keeping the data constant. What I do is, I generate a model
> several
> >> times and keep a look for the accuracy, and if it is above 90%, then I
> stop
> >> running the code and hence an accurate model is created.
> >>
> >> So what you are saying is to shuffle my data before I use it for
> training
> >> and testing?
> >> Thanks!
> >> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
> >>
> >>> Now I remember: SGD wants its data input in random order. You need to
> >>> permute the order of your data.
> >>>
> >>> If that does not help, another trick: for each data point, randomly
> >>> generate 5 or 10 or 20 points which are close. And again, randomly
> >>> permute the entire input set.
> >>>
> >>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <goksron@gmail.com>
> >> wrote:
> >>>> The more data you have, the closer each run will be. How much data do
> >> you have?
> >>>>
> >>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <
> salman@influestor.com>
> >> wrote:
> >>>>> I have noticed that every time I train and test a model using the
> same
> >> data (in SGD algo), I get different confusion matrix. Meaning, if I
> >> generate a model and look at the confusion matrix, it might say 90%
> >> correctly classified instances, but if I generate the model again (with
> the
> >> SAME data for training and testing as before) and test it, the confusion
> >> matrix changes and it might say 75% correctly classified instances.
> >>>>>
> >>>>> Is this a desired behavior?
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Lance Norskog
> >>>> goksron@gmail.com
> >>>
> >>>
> >>>
> >>> --
> >>> Lance Norskog
> >>> goksron@gmail.com
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message