mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: SGD diferent confusion matrix for each run
Date Fri, 31 Aug 2012 12:27:09 GMT
First, this is a tiny training set.  You are well outside the intended
application range so you are likely to find less experience in the
community in that range.  That said, the algorithm should still produce
reasonably stable results.

Here are a few questions:

a) which class are you using to train your model?  I would start with
OnlineLogisticRegression and experiment with training rate schedules and
amount of regularization to find out how to build a good model.

b) how many times are you passing through your data?  Do you randomize the
order each time?  These are critical to proper training.  Instead of
randomizing order, you could just sample a data point at random and not
worry about using a complete permutation of the data.  With such a tiny
data set, you will need to pass through the data many times ... possibly
hundreds of times or more.

c) what kind of data do you have?  Sparse?  Dense?  How many variables?
 What kind?

d) can you post your data?

On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <>wrote:

> Thanks a lot lance. Let me elaborate the problem if it was a bit confusing.
> Assuming I am making a binary classifier using SGD. I have got 50 positive
> and 50 negative examples to train the classifier. After training and
> testing the model, the confusion matrix tells you the number of correctly
> and incorrectly classified instances. Let's assume I got 85% correct and
> 15% incorrect instances.
> Now if I run my program again using the same 50 negative and 50 positive
> examples, then according to my knowledge the classifier should yield the
> same results as before (cause not even a single training or testing data
> was changed), but this is not the case. I get different results for
> different runs. The confusion matrix figures changes each time I generate a
> model keeping the data constant. What I do is, I generate a model several
> times and keep a look for the accuracy, and if it is above 90%, then I stop
> running the code and hence an accurate model is created.
> So what you are saying is to shuffle my data before I use it for training
> and testing?
> Thanks!
> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
> > Now I remember: SGD wants its data input in random order. You need to
> > permute the order of your data.
> >
> > If that does not help, another trick: for each data point, randomly
> > generate 5 or 10 or 20 points which are close. And again, randomly
> > permute the entire input set.
> >
> > On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <>
> wrote:
> >> The more data you have, the closer each run will be. How much data do
> you have?
> >>
> >> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <>
> wrote:
> >>> I have noticed that every time I train and test a model using the same
> data (in SGD algo), I get different confusion matrix. Meaning, if I
> generate a model and look at the confusion matrix, it might say 90%
> correctly classified instances, but if I generate the model again (with the
> SAME data for training and testing as before) and test it, the confusion
> matrix changes and it might say 75% correctly classified instances.
> >>>
> >>> Is this a desired behavior?
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >>
> >
> >
> >
> > --
> > Lance Norskog
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message