Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BF7C9107EA for ; Tue, 4 Feb 2014 18:40:00 +0000 (UTC) Received: (qmail 20917 invoked by uid 500); 4 Feb 2014 18:39:58 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 20875 invoked by uid 500); 4 Feb 2014 18:39:57 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 20861 invoked by uid 99); 4 Feb 2014 18:39:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Feb 2014 18:39:57 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [94.124.120.49] (HELO server7.bhosted.nl) (94.124.120.49) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 04 Feb 2014 18:39:52 +0000 Received: (qmail 11806 invoked by uid 87); 4 Feb 2014 19:39:30 +0100 Received: from mail-yk0-f178.google.com (postmaster@frankscholten.nl@mail-yk0-f178.google.com) by server7 (envelope-from , uid 0) with qmail-scanner-2.02 (clamdscan: 0.97.8/18435. spamassassin: 3.3.2. Clear:RC:0(209.85.160.178):. Processed in 0.035739 secs); 04 Feb 2014 18:39:30 -0000 Received: from mail-yk0-f178.google.com (postmaster@frankscholten.nl@209.85.160.178) by server7.bhosted.nl with SMTP; 4 Feb 2014 19:39:30 +0100 Received: by mail-yk0-f178.google.com with SMTP id 79so49830377ykr.9 for ; Tue, 04 Feb 2014 10:39:28 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Tg0JhYrJNoM+SsbltYel8koCrtSQ5oOf+uimHh4vcAs=; b=hzlFGt9rp15Btleuj4hzrlZ/+4O0Vao/d+oRJIKKjLdBpuJe9qn/qdt1GHiHAd3lQn 7oluy2Olk3eCAOTksYrAJPZ1kBrLL0DWjjTxGNcnpG7UkKtuRBIzKlEie53hMOOBsOE2 oIbntb1AV5HXxXX6MkJAQiXYtJimwxZs2NaDpNZ711T4kR0JwKUmZOvQBqMBW7oJLlUn 9hJKMWLUavldmmLMaDNvmz+E4o2pK2cpRkwU5tlyooM3bm2EIfxCKOSncEzPO+YsCofL fdRCSDPYoJG3rWTb/Z6mnC+5+gC134/06hOm/yB4R+ZZFFPg/4ZJTOUO2N48XC1Ovts7 u/cg== MIME-Version: 1.0 X-Received: by 10.236.120.135 with SMTP id p7mr865026yhh.155.1391539167740; Tue, 04 Feb 2014 10:39:27 -0800 (PST) Received: by 10.170.78.130 with HTTP; Tue, 4 Feb 2014 10:39:27 -0800 (PST) In-Reply-To: References: <52F0B38E.6060905@apache.org> Date: Tue, 4 Feb 2014 19:39:27 +0100 Message-ID: Subject: Re: SGD classifier demo app From: Frank Scholten To: user@mahout.apache.org Cc: Sebastian Schelter Content-Type: multipart/alternative; boundary=20cf300fa9f9c8c7d304f198f5f4 X-Virus-Checked: Checked by ClamAV on apache.org --20cf300fa9f9c8c7d304f198f5f4 Content-Type: text/plain; charset=ISO-8859-1 Thanks Ted! Would indeed be a nice example to add. On Tue, Feb 4, 2014 at 10:40 AM, Ted Dunning wrote: > Yes. > > > On Tue, Feb 4, 2014 at 1:31 AM, Sebastian Schelter wrote: > > > Would be great to add this as an example to Mahout's codebase. > > > > > > On 02/04/2014 10:27 AM, Ted Dunning wrote: > > > >> Frank, > >> > >> I just munched on your code and sent a pull request. > >> > >> In doing this, I made a bunch of changes. Hope you liked them. > >> > >> These include massive simplification of the reading and vectorization. > >> This wasn't strictly necessary, but it seemed like a good idea. > >> > >> More important was the way that I changed the vectorization. For the > >> continuous values, I added log transforms. For the categorical values, > I > >> encoded as they are. I also increased the feature vector size to 100 to > >> avoid excessive collisions. > >> > >> In the learning code itself, I got rid of the use of index arrays in > favor > >> of shuffling the training data itself. I also tuned the learning > >> parameters a lot. > >> > >> The result is that the AUC that results is just a tiny bit less than 0.9 > >> which is pretty close to what I got in R. > >> > >> For everybody else, see > >> https://github.com/tdunning/mahout-sgd-bank-marketing for my version > and > >> https://github.com/tdunning/mahout-sgd-bank-marketing/ > >> compare/frankscholten:master...masterfor > >> my pull request. > >> > >> > >> > >> On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning > >> wrote: > >> > >> > >>> Johannes, > >>> > >>> Very good comments. > >>> > >>> Frank, > >>> > >>> As a benchmark, I just spent a few minutes building a logistic > regression > >>> model using R. For this model AUC on 10% held-out data is about 0.9. > >>> > >>> Here is a gist summarizing the results: > >>> > >>> https://gist.github.com/tdunning/8794734 > >>> > >>> > >>> > >>> > >>> On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte < > >>> johannes.schulte@gmail.com> wrote: > >>> > >>> Hi Frank, > >>>> > >>>> you are using the feature vector encoders which hash a combination of > >>>> feature name and feature value to 2 (default) locations in the vector. > >>>> The > >>>> vector size you configured is 11 and this is imo very small to the > >>>> possible > >>>> combination of values you have for your data (education, marital, > >>>> campaign). You can do no harm by using a much bigger cardinality (try > >>>> 1000). > >>>> > >>>> Second, you are using a continuous value encoder with passing in the > >>>> weight > >>>> your are using as string (e.g. variable "pDays"). I am not quite sure > >>>> about > >>>> the reasons in th mahout code right now but the way it is implemented > >>>> now, > >>>> every unique value should end up in a different location because the > >>>> continuous value is part of the hashing. Try adding the weight > directly > >>>> using a static word value encoder, addToVector("pDays",v,pDays) > >>>> > >>>> Last, you are also putting in the variable "campaign" as a continous > >>>> variable which should be probably a categorical variable, so just > added > >>>> with a StaticWorldValueEncoder. > >>>> > >>>> And finally and probably most important after looking at your target > >>>> variable: you are using a Dictionary for mapping either y or no to 0 > or > >>>> 1. > >>>> This is bad. Depending on what comes first in the data set, either a > >>>> positive or negative example might be 0 or 1, totally random. Make a > >>>> hard > >>>> mapping from the possible values (y/n?) to zero and one, having yes > the > >>>> 1 > >>>> and no the zero. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten < > frank@frankscholten.nl > >>>> > >>>>> wrote: > >>>>> > >>>> > >>>> Hi all, > >>>>> > >>>>> I am exploring Mahout's SGD classifier and like some feedback > because I > >>>>> think I didn't properly configure things. > >>>>> > >>>>> I created an example app that trains an SGD classifier on the 'bank > >>>>> marketing' dataset from UCI: > >>>>> http://archive.ics.uci.edu/ml/datasets/Bank+Marketing > >>>>> > >>>>> My app is at: > >>>>> > >>>> https://github.com/frankscholten/mahout-sgd-bank-marketing > >>>> > >>>>> > >>>>> The app reads a CSV file of telephone calls, encodes the features > into > >>>>> a > >>>>> vector and tries to predict whether a customer answers yes to a > >>>>> business > >>>>> proposal. > >>>>> > >>>>> I do a few runs and measure accuracy but I'm I don't trust the > results. > >>>>> When I only use an intercept term as a feature I get around 88% > >>>>> accuracy > >>>>> and when I add all features it drops to around 85%. Is this perhaps > >>>>> > >>>> because > >>>> > >>>>> the dataset highly unbalanced? Most customers answer no. Or is the > >>>>> classifier biased to predict 0 as the target code when it doesn't > have > >>>>> > >>>> any > >>>> > >>>>> data to go with? > >>>>> > >>>>> Any other comments about my code or improvements I can make in the > app > >>>>> > >>>> are > >>>> > >>>>> welcome! :) > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Frank > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > --20cf300fa9f9c8c7d304f198f5f4--