Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1001010B27 for ; Mon, 3 Feb 2014 20:34:16 +0000 (UTC) Received: (qmail 97602 invoked by uid 500); 3 Feb 2014 20:34:13 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 97521 invoked by uid 500); 3 Feb 2014 20:34:12 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 97512 invoked by uid 99); 3 Feb 2014 20:34:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Feb 2014 20:34:12 +0000 X-ASF-Spam-Status: No, hits=4.6 required=5.0 tests=HK_SCAM_S7,HTML_MESSAGE,SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [94.124.120.49] (HELO server7.bhosted.nl) (94.124.120.49) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 03 Feb 2014 20:34:07 +0000 Received: (qmail 6827 invoked by uid 87); 3 Feb 2014 21:33:43 +0100 Received: from mail-yk0-f174.google.com (postmaster@frankscholten.nl@mail-yk0-f174.google.com) by server7 (envelope-from , uid 0) with qmail-scanner-2.02 (clamdscan: 0.97.8/18430. spamassassin: 3.3.2. Clear:RC:0(209.85.160.174):. Processed in 0.030911 secs); 03 Feb 2014 20:33:43 -0000 Received: from mail-yk0-f174.google.com (postmaster@frankscholten.nl@209.85.160.174) by server7.bhosted.nl with SMTP; 3 Feb 2014 21:33:43 +0100 Received: by mail-yk0-f174.google.com with SMTP id 10so41901922ykt.5 for ; Mon, 03 Feb 2014 12:33:42 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=mime-version:date:message-id:subject:from:to:content-type; bh=VjhilqGTC2eFaDIMOgxlMxS5Oo4BMVBovbvnjRWwtFc=; b=YUlANoZSNw8vAvDFKGci08gSG/wSdHOtBbMwJoXuP6kdySKO1kksFk8qKTke9+EGOz CAQHc1J9kqrGsOGilorZ2u2vhMLwjJbUJj4qHbhcNPiwZeHKlCNa1ZrbNNLMc/1OAVJw vSZJi50DdRG8QUGYzJQuqiIDrxUQ1mFY8J0ITIUqB+FdajJW/F6jeIQu+yyfvtBvFpcj kiFjMoeR15mqC648DXukbsi4UMN8zQmTFZGvo4Km7rd51RiSXVLvE5egtGWn0N4gcoUH iNMlqfF2e46FNwDPeyay0G+UD3ZY0SSVJBwFcYHQxSGFi9S9Yolic+cgX7hSWkR4S/8j AhRA== MIME-Version: 1.0 X-Received: by 10.236.139.234 with SMTP id c70mr35003284yhj.26.1391459622350; Mon, 03 Feb 2014 12:33:42 -0800 (PST) Received: by 10.170.78.130 with HTTP; Mon, 3 Feb 2014 12:33:42 -0800 (PST) Date: Mon, 3 Feb 2014 21:33:42 +0100 Message-ID: Subject: SGD classifier demo app From: Frank Scholten To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=485b397dceab8273d204f18670d9 X-Virus-Checked: Checked by ClamAV on apache.org --485b397dceab8273d204f18670d9 Content-Type: text/plain; charset=ISO-8859-1 Hi all, I am exploring Mahout's SGD classifier and like some feedback because I think I didn't properly configure things. I created an example app that trains an SGD classifier on the 'bank marketing' dataset from UCI: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing My app is at: https://github.com/frankscholten/mahout-sgd-bank-marketing The app reads a CSV file of telephone calls, encodes the features into a vector and tries to predict whether a customer answers yes to a business proposal. I do a few runs and measure accuracy but I'm I don't trust the results. When I only use an intercept term as a feature I get around 88% accuracy and when I add all features it drops to around 85%. Is this perhaps because the dataset highly unbalanced? Most customers answer no. Or is the classifier biased to predict 0 as the target code when it doesn't have any data to go with? Any other comments about my code or improvements I can make in the app are welcome! :) Cheers, Frank --485b397dceab8273d204f18670d9--