Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 503684CD8 for ; Tue, 28 Jun 2011 00:51:47 +0000 (UTC) Received: (qmail 23151 invoked by uid 500); 28 Jun 2011 00:51:46 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 23060 invoked by uid 500); 28 Jun 2011 00:51:45 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 23052 invoked by uid 99); 28 Jun 2011 00:51:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jun 2011 00:51:45 +0000 X-ASF-Spam-Status: No, hits=0.2 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL,TO_NO_BRKTS_PCNT X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.210.42] (HELO mail-pz0-f42.google.com) (209.85.210.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jun 2011 00:51:38 +0000 Received: by pzk37 with SMTP id 37so6297737pzk.1 for ; Mon, 27 Jun 2011 17:51:17 -0700 (PDT) Received: by 10.142.13.20 with SMTP id 20mr1335941wfm.249.1309222277394; Mon, 27 Jun 2011 17:51:17 -0700 (PDT) Received: from Patrick-Collinss-MacBook-Air.local (c-69-181-98-62.hsd1.ca.comcast.net [69.181.98.62]) by mx.google.com with ESMTPS id b8sm4691149pbj.62.2011.06.27.17.51.15 (version=SSLv3 cipher=OTHER); Mon, 27 Jun 2011 17:51:16 -0700 (PDT) Message-ID: <4E092581.9040903@ready2sign.com> Date: Mon, 27 Jun 2011 17:51:13 -0700 From: Patrick Collins User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.18) Gecko/20110616 Thunderbird/3.1.11 MIME-Version: 1.0 To: user@mahout.apache.org Subject: Fuzzy logic and Heuristics vs Classification Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Has anyone got any advice on how to combine heuristics and classification? When preparing my data to build out the features to feed into my classification model I keep noticing patterns of text which I know with 99.99% probability implies a certain outcome. How would you construct the data/features in order to pre-classify this data to provide much more likelihood that the classifier comes to the "correct" conclusion? For example, I remember seeing an anti-spam machine which used a combination of fuzzy logic and then classification to build a better outcome (but he did not detail out how it was actually implemented). He used a whole range of heuristics to determine that a certain sender is known to be a spammer rather than just blindly passing this data in to the classifier. In my dataset I have a LOT of patterns like this that I can identify and then determine with very high probability the outcome. I say high probability, but I cannot say absolutely. Ideally if I could pre compute a lot of this data using heuristics I could feed this information in to the classifier to greatly reduce the number of features. But the classifiers do not allow me the ability to provide a "weight" to a certain feature. Other than "well just try and see what works", I was wondering how do people deal with this problem? Do they just leave it to the classifier and hope that the classifier picks up the same patterns? I'm a bit new to mahout and classification algorithms and so am just trying to get some input from how others might see this problem and whether I'm barking up the wrong tree. Patrick.