mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: SGD classifier demo app
Date Tue, 04 Feb 2014 09:31:58 GMT
Would be great to add this as an example to Mahout's codebase.

On 02/04/2014 10:27 AM, Ted Dunning wrote:
> Frank,
>
> I just munched on your code and sent a pull request.
>
> In doing this, I made a bunch of changes.  Hope you liked them.
>
> These include massive simplification of the reading and vectorization.
>   This wasn't strictly necessary, but it seemed like a good idea.
>
> More important was the way that I changed the vectorization.  For the
> continuous values, I added log transforms.  For the categorical values, I
> encoded as they are.  I also increased the feature vector size to 100 to
> avoid excessive collisions.
>
> In the learning code itself, I got rid of the use of index arrays in favor
> of shuffling the training data itself.  I also tuned the learning
> parameters a lot.
>
> The result is that the AUC that results is just a tiny bit less than 0.9
> which is pretty close to what I got in R.
>
> For everybody else, see
> https://github.com/tdunning/mahout-sgd-bank-marketing for my version and
> https://github.com/tdunning/mahout-sgd-bank-marketing/compare/frankscholten:master...masterfor
> my pull request.
>
>
>
> On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>>
>> Johannes,
>>
>> Very good comments.
>>
>> Frank,
>>
>> As a benchmark, I just spent a few minutes building a logistic regression
>> model using R.  For this model AUC on 10% held-out data is about 0.9.
>>
>> Here is a gist summarizing the results:
>>
>> https://gist.github.com/tdunning/8794734
>>
>>
>>
>>
>> On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte <
>> johannes.schulte@gmail.com> wrote:
>>
>>> Hi Frank,
>>>
>>> you are using the feature vector encoders which hash a combination of
>>> feature name and feature value to 2 (default) locations in the vector. The
>>> vector size you configured is 11 and this is imo very small to the
>>> possible
>>> combination of values you have for your data (education, marital,
>>> campaign). You can do no harm by using a much bigger cardinality (try
>>> 1000).
>>>
>>> Second, you are using a continuous value encoder with passing in the
>>> weight
>>> your are using as string (e.g. variable "pDays"). I am not quite sure
>>> about
>>> the reasons in th mahout code right now but the way it is implemented now,
>>> every unique value should end up in a different location because the
>>> continuous value is part of the hashing. Try adding the weight directly
>>> using a static word value encoder, addToVector("pDays",v,pDays)
>>>
>>> Last, you are also putting in the variable "campaign" as a continous
>>> variable which should be probably a categorical variable, so just added
>>> with a StaticWorldValueEncoder.
>>>
>>> And finally and probably most important after looking at your target
>>> variable: you are using a Dictionary for mapping either y or no to 0 or 1.
>>> This is bad. Depending on what comes first in the data set, either a
>>> positive or negative example might be 0 or 1, totally random. Make a hard
>>> mapping from the possible values (y/n?) to zero and one, having yes the 1
>>> and no the zero.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten <frank@frankscholten.nl
>>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am exploring Mahout's SGD classifier and like some feedback because I
>>>> think I didn't properly configure things.
>>>>
>>>> I created an example app that trains an SGD classifier on the 'bank
>>>> marketing' dataset from UCI:
>>>> http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
>>>>
>>>> My app is at:
>>> https://github.com/frankscholten/mahout-sgd-bank-marketing
>>>>
>>>> The app reads a CSV file of telephone calls, encodes the features into a
>>>> vector and tries to predict whether a customer answers yes to a business
>>>> proposal.
>>>>
>>>> I do a few runs and measure accuracy but I'm I don't trust the results.
>>>> When I only use an intercept term as a feature I get around 88% accuracy
>>>> and when I add all features it drops to around 85%. Is this perhaps
>>> because
>>>> the dataset highly unbalanced? Most customers answer no. Or is the
>>>> classifier biased to predict 0 as the target code when it doesn't have
>>> any
>>>> data to go with?
>>>>
>>>> Any other comments about my code or improvements I can make in the app
>>> are
>>>> welcome! :)
>>>>
>>>> Cheers,
>>>>
>>>> Frank
>>>>
>>>
>>
>>
>


Mime
View raw message