From user-return-19405-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Tue Feb 4 09:32:33 2014 Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 155F410308 for ; Tue, 4 Feb 2014 09:32:33 +0000 (UTC) Received: (qmail 30787 invoked by uid 500); 4 Feb 2014 09:32:28 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 30749 invoked by uid 500); 4 Feb 2014 09:32:27 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 30741 invoked by uid 99); 4 Feb 2014 09:32:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Feb 2014 09:32:26 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS,TO_NO_BRKTS_PCNT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ssc.open@googlemail.com designates 74.125.82.43 as permitted sender) Received: from [74.125.82.43] (HELO mail-wg0-f43.google.com) (74.125.82.43) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Feb 2014 09:32:22 +0000 Received: by mail-wg0-f43.google.com with SMTP id y10so12272064wgg.34 for ; Tue, 04 Feb 2014 01:32:01 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=message-id:date:from:reply-to:user-agent:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=NVDVDEjs9/kCTIWTnwZ6YWd+IKBSQ86hk3oE6HO2sx8=; b=UXmA0fgjiMse6TfAhkdPuIjPLqaEa2XJUXPCau5nPV5muZLMPQDrsXwEtgAgb1ZFGB HA+z5yjCdQ/MOSo8D6gfheXYDAO2g66v4gSLzMvihrCp5qA7azYqDdpY7IvHKNLlXBbA 2Qfkf/G0ALCS9eGEQtTJEhg7aZSryruBHN6InjrPZ8PRjZYPujBUHdMU91C8WVmPwYuN tgW0FteEEqRFzt4KCWT3luSbgVryGRL8OU3OM0oSQcG8kXn33X+WQuzrHR/L1PPZyQ/O GOuJl86iUlJCCG8vNAOQGCmOoik7WR2h89sMvbw2uFtozhknOSNvgPzLNZub7eC4ZGZ/ iUDQ== X-Received: by 10.180.101.40 with SMTP id fd8mr6906195wib.1.1391506321108; Tue, 04 Feb 2014 01:32:01 -0800 (PST) Received: from [192.168.0.2] (f052129203.adsl.alicedsl.de. [78.52.129.203]) by mx.google.com with ESMTPSA id f7sm50989268wjb.7.2014.02.04.01.31.59 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 04 Feb 2014 01:32:00 -0800 (PST) Message-ID: <52F0B38E.6060905@apache.org> Date: Tue, 04 Feb 2014 10:31:58 +0100 From: Sebastian Schelter Reply-To: ssc@apache.org User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: user@mahout.apache.org Subject: Re: SGD classifier demo app References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Would be great to add this as an example to Mahout's codebase. On 02/04/2014 10:27 AM, Ted Dunning wrote: > Frank, > > I just munched on your code and sent a pull request. > > In doing this, I made a bunch of changes. Hope you liked them. > > These include massive simplification of the reading and vectorization. > This wasn't strictly necessary, but it seemed like a good idea. > > More important was the way that I changed the vectorization. For the > continuous values, I added log transforms. For the categorical values, I > encoded as they are. I also increased the feature vector size to 100 to > avoid excessive collisions. > > In the learning code itself, I got rid of the use of index arrays in favor > of shuffling the training data itself. I also tuned the learning > parameters a lot. > > The result is that the AUC that results is just a tiny bit less than 0.9 > which is pretty close to what I got in R. > > For everybody else, see > https://github.com/tdunning/mahout-sgd-bank-marketing for my version and > https://github.com/tdunning/mahout-sgd-bank-marketing/compare/frankscholten:master...masterfor > my pull request. > > > > On Mon, Feb 3, 2014 at 3:57 PM, Ted Dunning wrote: > >> >> Johannes, >> >> Very good comments. >> >> Frank, >> >> As a benchmark, I just spent a few minutes building a logistic regression >> model using R. For this model AUC on 10% held-out data is about 0.9. >> >> Here is a gist summarizing the results: >> >> https://gist.github.com/tdunning/8794734 >> >> >> >> >> On Mon, Feb 3, 2014 at 2:41 PM, Johannes Schulte < >> johannes.schulte@gmail.com> wrote: >> >>> Hi Frank, >>> >>> you are using the feature vector encoders which hash a combination of >>> feature name and feature value to 2 (default) locations in the vector. The >>> vector size you configured is 11 and this is imo very small to the >>> possible >>> combination of values you have for your data (education, marital, >>> campaign). You can do no harm by using a much bigger cardinality (try >>> 1000). >>> >>> Second, you are using a continuous value encoder with passing in the >>> weight >>> your are using as string (e.g. variable "pDays"). I am not quite sure >>> about >>> the reasons in th mahout code right now but the way it is implemented now, >>> every unique value should end up in a different location because the >>> continuous value is part of the hashing. Try adding the weight directly >>> using a static word value encoder, addToVector("pDays",v,pDays) >>> >>> Last, you are also putting in the variable "campaign" as a continous >>> variable which should be probably a categorical variable, so just added >>> with a StaticWorldValueEncoder. >>> >>> And finally and probably most important after looking at your target >>> variable: you are using a Dictionary for mapping either y or no to 0 or 1. >>> This is bad. Depending on what comes first in the data set, either a >>> positive or negative example might be 0 or 1, totally random. Make a hard >>> mapping from the possible values (y/n?) to zero and one, having yes the 1 >>> and no the zero. >>> >>> >>> >>> >>> >>> On Mon, Feb 3, 2014 at 9:33 PM, Frank Scholten >>> wrote: >>> >>>> Hi all, >>>> >>>> I am exploring Mahout's SGD classifier and like some feedback because I >>>> think I didn't properly configure things. >>>> >>>> I created an example app that trains an SGD classifier on the 'bank >>>> marketing' dataset from UCI: >>>> http://archive.ics.uci.edu/ml/datasets/Bank+Marketing >>>> >>>> My app is at: >>> https://github.com/frankscholten/mahout-sgd-bank-marketing >>>> >>>> The app reads a CSV file of telephone calls, encodes the features into a >>>> vector and tries to predict whether a customer answers yes to a business >>>> proposal. >>>> >>>> I do a few runs and measure accuracy but I'm I don't trust the results. >>>> When I only use an intercept term as a feature I get around 88% accuracy >>>> and when I add all features it drops to around 85%. Is this perhaps >>> because >>>> the dataset highly unbalanced? Most customers answer no. Or is the >>>> classifier biased to predict 0 as the target code when it doesn't have >>> any >>>> data to go with? >>>> >>>> Any other comments about my code or improvements I can make in the app >>> are >>>> welcome! :) >>>> >>>> Cheers, >>>> >>>> Frank >>>> >>> >> >> >