mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Salman Mahmood <>
Subject Re: SGD diferent confusion matrix for each run
Date Fri, 31 Aug 2012 16:04:06 GMT
Cheers ted. Appreciate the input!

Sent from my iPhone

On 31 Aug 2012, at 17:53, Ted Dunning <> wrote:

> OK.
> Try passing through the data 100 times for a start.  I think that this is
> likely to fix your problems.
> Be warned that AdaptiveLogisticRegression has been misbehaving lately and
> may converge faster than it should.
> On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <>wrote:
>> Thanks a lot ted. Here are the answers:
>> d) Data (news articles from different feeds)
>>        News Article 1: Title : BP Profits Plunge On Massive Asset
>> Write-down
>>                                    Description :BP PLC (BP) Tuesday
>> posted a dramatic fall of 96% in adjusted profit for the
>> second quarter as it wrote down the value of its assets by $5 billion
>> including some U.S. refineries a suspended Alaskan oil project and U.S.
>> shale gas resources
>>        News Article 2: Title : Morgan Stanley Missed Big
>>                                     Description: Why It's Still A
>> Fantastic Short,"By Mike Williams: Though the market responded very
>> positively to Citigroup (C) and Bank of America's (BAC) reserve
>> release-driven earnings ""beats"" last week's Morgan Stanley (MS) earnings
>> report illustrated what happens when a bank doesn't have billions of
>> reserves to release back into earnings. Estimates called for the following:
>> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt value
>> adjustment) $7.7 billion in revenue GAAP results (including the DVA) came
>> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a
>> particular disappointment coming in at $6.95 billion.
>> c) As you can see the data is textual. and I am using title and
>> description as predictor variable and the target variable is the company
>> name a news belongs to.
>> b) I am passing through the data once (at least this is what I think). I
>> folowed the 20newsgroup example code(in java) and dint find that the data
>> was passed more than once.
>> Yes I randomize the order every time.
>> a) I am using AdaptiveLearningRegression (just like 20newsgroup).
>> Thanks!
>> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
>>> First, this is a tiny training set.  You are well outside the intended
>>> application range so you are likely to find less experience in the
>>> community in that range.  That said, the algorithm should still produce
>>> reasonably stable results.
>>> Here are a few questions:
>>> a) which class are you using to train your model?  I would start with
>>> OnlineLogisticRegression and experiment with training rate schedules and
>>> amount of regularization to find out how to build a good model.
>>> b) how many times are you passing through your data?  Do you randomize
>> the
>>> order each time?  These are critical to proper training.  Instead of
>>> randomizing order, you could just sample a data point at random and not
>>> worry about using a complete permutation of the data.  With such a tiny
>>> data set, you will need to pass through the data many times ... possibly
>>> hundreds of times or more.
>>> c) what kind of data do you have?  Sparse?  Dense?  How many variables?
>>> What kind?
>>> d) can you post your data?
>>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <
>>> wrote:
>>>> Thanks a lot lance. Let me elaborate the problem if it was a bit
>> confusing.
>>>> Assuming I am making a binary classifier using SGD. I have got 50
>> positive
>>>> and 50 negative examples to train the classifier. After training and
>>>> testing the model, the confusion matrix tells you the number of
>> correctly
>>>> and incorrectly classified instances. Let's assume I got 85% correct and
>>>> 15% incorrect instances.
>>>> Now if I run my program again using the same 50 negative and 50 positive
>>>> examples, then according to my knowledge the classifier should yield the
>>>> same results as before (cause not even a single training or testing data
>>>> was changed), but this is not the case. I get different results for
>>>> different runs. The confusion matrix figures changes each time I
>> generate a
>>>> model keeping the data constant. What I do is, I generate a model
>> several
>>>> times and keep a look for the accuracy, and if it is above 90%, then I
>> stop
>>>> running the code and hence an accurate model is created.
>>>> So what you are saying is to shuffle my data before I use it for
>> training
>>>> and testing?
>>>> Thanks!
>>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
>>>>> Now I remember: SGD wants its data input in random order. You need to
>>>>> permute the order of your data.
>>>>> If that does not help, another trick: for each data point, randomly
>>>>> generate 5 or 10 or 20 points which are close. And again, randomly
>>>>> permute the entire input set.
>>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <>
>>>> wrote:
>>>>>> The more data you have, the closer each run will be. How much data
>>>> you have?
>>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <
>>>> wrote:
>>>>>>> I have noticed that every time I train and test a model using
>> same
>>>> data (in SGD algo), I get different confusion matrix. Meaning, if I
>>>> generate a model and look at the confusion matrix, it might say 90%
>>>> correctly classified instances, but if I generate the model again (with
>> the
>>>> SAME data for training and testing as before) and test it, the confusion
>>>> matrix changes and it might say 75% correctly classified instances.
>>>>>>> Is this a desired behavior?
>>>>>> --
>>>>>> Lance Norskog
>>>>> --
>>>>> Lance Norskog

View raw message