Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@locus.apache.org Received: (qmail 79185 invoked from network); 20 Jul 2008 01:13:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Jul 2008 01:13:56 -0000 Received: (qmail 61883 invoked by uid 500); 20 Jul 2008 01:13:56 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 61857 invoked by uid 500); 20 Jul 2008 01:13:55 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 61846 invoked by uid 99); 20 Jul 2008 01:13:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Jul 2008 18:13:55 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of philippe.lamarche@gmail.com designates 209.85.198.229 as permitted sender) Received: from [209.85.198.229] (HELO rv-out-0506.google.com) (209.85.198.229) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Jul 2008 01:12:59 +0000 Received: by rv-out-0506.google.com with SMTP id f6so875508rvb.5 for ; Sat, 19 Jul 2008 18:13:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type:content-transfer-encoding :content-disposition; bh=0De4/6H2WiQ1I3iBZ1UpMjrwLnZr2MvV5DAKY1A6nhQ=; b=Ct4EuFmM/twOSvMn9LKbrJUboyet+rbSsF2hRXrXCQMuZYiR2LaewsuTfsJanxZjf8 ZLXyHXTiuaA5hbPpyWvS7oXvTSpWAkOc8e0J0uWpZvO2p7sW9bDo5pMEnp9vKT02G9Vh 0ukvsLRT5wDPkuCMNApRaUcZOL7Zyl7Hxg4GE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type :content-transfer-encoding:content-disposition; b=HHQGjLtBVtAGn8aQu0KWvVPNCSw0IQmToxGZNoiPQAT0XUL/ec7KDaQgeipglfvJGA utufOlLPrhP6wC7DXlUkBhs8dt770fWQ2Zhop3x613q573Xp/LOq0Au0oC+77oM1+oRi BRpSr/K0m7y/RoFVWGFvSK/bkGn4ZKrUx5j38= Received: by 10.140.164.1 with SMTP id m1mr977485rve.266.1216516403601; Sat, 19 Jul 2008 18:13:23 -0700 (PDT) Received: by 10.141.137.2 with HTTP; Sat, 19 Jul 2008 18:13:23 -0700 (PDT) Message-ID: <7da7efbf0807191813l731922f6udc8357640b3aac05@mail.gmail.com> Date: Sat, 19 Jul 2008 21:13:23 -0400 From: "Philippe Lamarche" To: mahout-user@lucene.apache.org Subject: Problems with the Bayesian classifiers. MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Virus-Checked: Checked by ClamAV on apache.org Hi, I have been working for a little while with Mahout and the Bayesian classifier for a school project. I am using the Enron email corpus and the UC Berkeley classified emails (http://www.cs.cmu.edu/~enron/). I did a few tests and I can't seem to make it work. I wonder if I am doing something wrong. For example, I am getting correct prediction under 10%, with Bayes and around 1% with CBayes. The problem seems to lie in the fact that all instances of a class will be predicted to another class, or that they will all be predicted to the class containing the more feature. I also tested with the 20News corpus and I get similar result where all instances of a class will be predicted to another class. (e.g. all 421 "rec.motorcycles" get predicted as "talk.politics.mideast"). Attached is two confusions matrix displaying results for bayes and cbayes. Both used the same division in the training and testing set. Am I doing something wrong? Thanks, Philippe Lamarche.