mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: SGD diferent confusion matrix for each run
Date Sat, 01 Sep 2012 03:24:15 GMT
That would be best, but practically speaking, randomizing once is usually
OK.  With a tiny data set like this that is in memory anyway, I wouldn't
take any chances.

On Fri, Aug 31, 2012 at 9:08 PM, Lance Norskog <goksron@gmail.com> wrote:

> "Try passing through the data 100 times for a start. "
>
> And randomize the order each time?
>
> On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood <salman@influestor.com>
> wrote:
> > Cheers ted. Appreciate the input!
> >
> > Sent from my iPhone
> >
> > On 31 Aug 2012, at 17:53, Ted Dunning <ted.dunning@gmail.com> wrote:
> >
> >> OK.
> >>
> >> Try passing through the data 100 times for a start.  I think that this
> is
> >> likely to fix your problems.
> >>
> >> Be warned that AdaptiveLogisticRegression has been misbehaving lately
> and
> >> may converge faster than it should.
> >>
> >> On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <salman@influestor.com
> >wrote:
> >>
> >>> Thanks a lot ted. Here are the answers:
> >>> d) Data (news articles from different feeds)
> >>>        News Article 1: Title : BP Profits Plunge On Massive Asset
> >>> Write-down
> >>>                                    Description :BP PLC (BP) Tuesday
> >>> posted a dramatic fall of 96% in adjusted profit for the
> >>> second quarter as it wrote down the value of its assets by $5 billion
> >>> including some U.S. refineries a suspended Alaskan oil project and U.S.
> >>> shale gas resources
> >>>
> >>>        News Article 2: Title : Morgan Stanley Missed Big
> >>>                                     Description: Why It's Still A
> >>> Fantastic Short,"By Mike Williams: Though the market responded very
> >>> positively to Citigroup (C) and Bank of America's (BAC) reserve
> >>> release-driven earnings ""beats"" last week's Morgan Stanley (MS)
> earnings
> >>> report illustrated what happens when a bank doesn't have billions of
> >>> reserves to release back into earnings. Estimates called for the
> following:
> >>> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt
> value
> >>> adjustment) $7.7 billion in revenue GAAP results (including the DVA)
> came
> >>> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a
> >>> particular disappointment coming in at $6.95 billion.
> >>>
> >>> c) As you can see the data is textual. and I am using title and
> >>> description as predictor variable and the target variable is the
> company
> >>> name a news belongs to.
> >>>
> >>> b) I am passing through the data once (at least this is what I think).
> I
> >>> folowed the 20newsgroup example code(in java) and dint find that the
> data
> >>> was passed more than once.
> >>> Yes I randomize the order every time.
> >>>
> >>> a) I am using AdaptiveLearningRegression (just like 20newsgroup).
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>
> >>> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
> >>>
> >>>> First, this is a tiny training set.  You are well outside the intended
> >>>> application range so you are likely to find less experience in the
> >>>> community in that range.  That said, the algorithm should still
> produce
> >>>> reasonably stable results.
> >>>>
> >>>> Here are a few questions:
> >>>>
> >>>> a) which class are you using to train your model?  I would start with
> >>>> OnlineLogisticRegression and experiment with training rate schedules
> and
> >>>> amount of regularization to find out how to build a good model.
> >>>>
> >>>> b) how many times are you passing through your data?  Do you randomize
> >>> the
> >>>> order each time?  These are critical to proper training.  Instead of
> >>>> randomizing order, you could just sample a data point at random and
> not
> >>>> worry about using a complete permutation of the data.  With such a
> tiny
> >>>> data set, you will need to pass through the data many times ...
> possibly
> >>>> hundreds of times or more.
> >>>>
> >>>> c) what kind of data do you have?  Sparse?  Dense?  How many
> variables?
> >>>> What kind?
> >>>>
> >>>> d) can you post your data?
> >>>>
> >>>>
> >>>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <
> salman@influestor.com
> >>>> wrote:
> >>>>
> >>>>> Thanks a lot lance. Let me elaborate the problem if it was a bit
> >>> confusing.
> >>>>>
> >>>>> Assuming I am making a binary classifier using SGD. I have got 50
> >>> positive
> >>>>> and 50 negative examples to train the classifier. After training
and
> >>>>> testing the model, the confusion matrix tells you the number of
> >>> correctly
> >>>>> and incorrectly classified instances. Let's assume I got 85% correct
> and
> >>>>> 15% incorrect instances.
> >>>>>
> >>>>> Now if I run my program again using the same 50 negative and 50
> positive
> >>>>> examples, then according to my knowledge the classifier should yield
> the
> >>>>> same results as before (cause not even a single training or testing
> data
> >>>>> was changed), but this is not the case. I get different results
for
> >>>>> different runs. The confusion matrix figures changes each time I
> >>> generate a
> >>>>> model keeping the data constant. What I do is, I generate a model
> >>> several
> >>>>> times and keep a look for the accuracy, and if it is above 90%,
then
> I
> >>> stop
> >>>>> running the code and hence an accurate model is created.
> >>>>>
> >>>>> So what you are saying is to shuffle my data before I use it for
> >>> training
> >>>>> and testing?
> >>>>> Thanks!
> >>>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
> >>>>>
> >>>>>> Now I remember: SGD wants its data input in random order. You
need
> to
> >>>>>> permute the order of your data.
> >>>>>>
> >>>>>> If that does not help, another trick: for each data point, randomly
> >>>>>> generate 5 or 10 or 20 points which are close. And again, randomly
> >>>>>> permute the entire input set.
> >>>>>>
> >>>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <goksron@gmail.com>
> >>>>> wrote:
> >>>>>>> The more data you have, the closer each run will be. How
much data
> do
> >>>>> you have?
> >>>>>>>
> >>>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <
> >>> salman@influestor.com>
> >>>>> wrote:
> >>>>>>>> I have noticed that every time I train and test a model
using the
> >>> same
> >>>>> data (in SGD algo), I get different confusion matrix. Meaning, if
I
> >>>>> generate a model and look at the confusion matrix, it might say
90%
> >>>>> correctly classified instances, but if I generate the model again
> (with
> >>> the
> >>>>> SAME data for training and testing as before) and test it, the
> confusion
> >>>>> matrix changes and it might say 75% correctly classified instances.
> >>>>>>>>
> >>>>>>>> Is this a desired behavior?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Lance Norskog
> >>>>>>> goksron@gmail.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Lance Norskog
> >>>>>> goksron@gmail.com
> >>>>>
> >>>>>
> >>>
> >>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message