mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: SGD classifier demo app
Date Tue, 04 Feb 2014 09:40:23 GMT
Yes.


On Tue, Feb 4, 2014 at 1:31 AM, Sebastian Schelter <ssc@apache.org> wrote:

> Would be great to add this as an example to Mahout's codebase.
>
>
> On 02/04/2014 10:27 AM, Ted Dunning wrote:
>
>> Frank,
>>
>> I just munched on your code and sent a pull request.
>>
>> In doing this, I made a bunch of changes.  Hope you liked them.
>>
>> These include massive simplification of the reading and vectorization.
>>   This wasn't strictly necessary, but it seemed like a good idea.
>>
>> More important was the way that I changed the vectorization.  For the
>> continuous values, I added log transforms.  For the categorical values, I
>> encoded as they are.  I also increased the feature vector size to 100 to
>> avoid excessive collisions.
>>
>> In the learning code itself, I got rid of the use of index arrays in favor
>> of shuffling the training data itself.  I also tuned the learning
>> parameters a lot.
>>
>> The result is that the AUC that results is just a tiny bit less than 0.9
>> which is pretty close to what I got in R.
>>
>> For everybody else, see
>> https://github.com/tdunning/mahout-sgd-bank-marketing for my version and
>> https://github.com/tdunning/mahout-sgd-bank-marketing/
>> compare/frankscholten:master...masterfor
>> my pull request.
>>
>>
>>
>> On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>>
>>
>>> Johannes,
>>>
>>> Very good comments.
>>>
>>> Frank,
>>>
>>> As a benchmark, I just spent a few minutes building a logistic regression
>>> model using R.  For this model AUC on 10% held-out data is about 0.9.
>>>
>>> Here is a gist summarizing the results:
>>>
>>> https://gist.github.com/tdunning/8794734
>>>
>>>
>>>
>>>
>>> On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte <
>>> johannes.schulte@gmail.com> wrote:
>>>
>>>  Hi Frank,
>>>>
>>>> you are using the feature vector encoders which hash a combination of
>>>> feature name and feature value to 2 (default) locations in the vector.
>>>> The
>>>> vector size you configured is 11 and this is imo very small to the
>>>> possible
>>>> combination of values you have for your data (education, marital,
>>>> campaign). You can do no harm by using a much bigger cardinality (try
>>>> 1000).
>>>>
>>>> Second, you are using a continuous value encoder with passing in the
>>>> weight
>>>> your are using as string (e.g. variable "pDays"). I am not quite sure
>>>> about
>>>> the reasons in th mahout code right now but the way it is implemented
>>>> now,
>>>> every unique value should end up in a different location because the
>>>> continuous value is part of the hashing. Try adding the weight directly
>>>> using a static word value encoder, addToVector("pDays",v,pDays)
>>>>
>>>> Last, you are also putting in the variable "campaign" as a continous
>>>> variable which should be probably a categorical variable, so just added
>>>> with a StaticWorldValueEncoder.
>>>>
>>>> And finally and probably most important after looking at your target
>>>> variable: you are using a Dictionary for mapping either y or no to 0 or
>>>> 1.
>>>> This is bad. Depending on what comes first in the data set, either a
>>>> positive or negative example might be 0 or 1, totally random. Make a
>>>> hard
>>>> mapping from the possible values (y/n?) to zero and one, having yes the
>>>> 1
>>>> and no the zero.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <frank@frankscholten.nl
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>  Hi all,
>>>>>
>>>>> I am exploring Mahout's SGD classifier and like some feedback because
I
>>>>> think I didn't properly configure things.
>>>>>
>>>>> I created an example app that trains an SGD classifier on the 'bank
>>>>> marketing' dataset from UCI:
>>>>> http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
>>>>>
>>>>> My app is at:
>>>>>
>>>> https://github.com/frankscholten/mahout-sgd-bank-marketing
>>>>
>>>>>
>>>>> The app reads a CSV file of telephone calls, encodes the features into
>>>>> a
>>>>> vector and tries to predict whether a customer answers yes to a
>>>>> business
>>>>> proposal.
>>>>>
>>>>> I do a few runs and measure accuracy but I'm I don't trust the results.
>>>>> When I only use an intercept term as a feature I get around 88%
>>>>> accuracy
>>>>> and when I add all features it drops to around 85%. Is this perhaps
>>>>>
>>>> because
>>>>
>>>>> the dataset highly unbalanced? Most customers answer no. Or is the
>>>>> classifier biased to predict 0 as the target code when it doesn't have
>>>>>
>>>> any
>>>>
>>>>> data to go with?
>>>>>
>>>>> Any other comments about my code or improvements I can make in the app
>>>>>
>>>> are
>>>>
>>>>> welcome! :)
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message