From user-return-9758-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Tue Sep 6 21:26:23 2011 Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E99B67FF9 for ; Tue, 6 Sep 2011 21:26:23 +0000 (UTC) Received: (qmail 18478 invoked by uid 500); 6 Sep 2011 21:26:22 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 18412 invoked by uid 500); 6 Sep 2011 21:26:22 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 18401 invoked by uid 99); 6 Sep 2011 21:26:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Sep 2011 21:26:21 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.216.177 as permitted sender) Received: from [209.85.216.177] (HELO mail-qy0-f177.google.com) (209.85.216.177) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Sep 2011 21:26:17 +0000 Received: by qyk2 with SMTP id 2so5677877qyk.1 for ; Tue, 06 Sep 2011 14:25:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=ZlEjn7e6nvg4ARE6SxX5nBx28HSrpHcG/m/97PVaiYo=; b=vcw38n45+LardiyBR7UZiSPfiGLTBJMLFkU9XwUmYVe0aYsjVWi5bzE+kz9AteT8Di FRGpxfq1VwAJYU0H2bZlZq+RUHWAWXgw6+5JS19UDpXwzX4b4i8JqG0stV0SY60jsuWt ul4DqnWwuLZ97YZsK7R5QbM4bxTI8w+cxTtIw= Received: by 10.224.178.141 with SMTP id bm13mr948635qab.323.1315344356071; Tue, 06 Sep 2011 14:25:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.80.136 with HTTP; Tue, 6 Sep 2011 14:25:36 -0700 (PDT) In-Reply-To: <2E1F0FC8-5DA4-42C1-811A-DE5E0B4604DB@apache.org> References: <856965A1-0827-45B4-9F1F-FB36D1967AE0@apache.org> <2E1F0FC8-5DA4-42C1-811A-DE5E0B4604DB@apache.org> From: Ted Dunning Date: Tue, 6 Sep 2011 14:25:36 -0700 Message-ID: Subject: Re: Email and Collab. Filtering To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=20cf303b3ae519d9ce04ac4c793d --20cf303b3ae519d9ce04ac4c793d Content-Type: text/plain; charset=UTF-8 My rationale for being such a binary bigot is that I have found that (in my experience) one signal always dominates pretty much completely. Other signals are pretty much just noise (too little engagement) are subject to spammy misdirection (bad titles on videos, for instance) or are too rare to give any significant lift (user ratings versus views/engagements). In cases where the alternative signal is more voluminous than the engagement that I am interested in, it is invariable very noisy. This is guaranteed since I would otherwise have used the higher volume signal. In every case I have tried, using the high volume, high noise signal degraded performance significantly because it made it hard to find the clean signal. The low volume signals have never led to any gain and often were strange enough that they hurt things badly. Besides, they typically are much less than 10% of the data. Aside from the general data quality and availability issues, there are the computational issues. Having binary data allows me to use much faster and cooler algorithms like LLR. The upshot is that I don't see anything but downside for including rating or synthetic rating data. I should add, of course, before lightning strikes that your mileage may vary. On Tue, Sep 6, 2011 at 12:56 PM, Grant Ingersoll wrote: > Ted, > > Been meaning to follow up on this... > > On Aug 22, 2011, at 11:29 AM, Ted Dunning wrote: > > > On Mon, Aug 22, 2011 at 8:21 AM, Daniel Xiaodan Zhou < > danithaca@gmail.com>wrote: > > > >> I think this is reasonable. Some suggestions: > >> > >> 1. Instead of using the total number of interactions as cell value, map > the > >> number to a 1-5 score based on histogram > >> > > > > I would map to {0,1} rather than a fake rating scale. > > What's your reasoning for this, versus, something like number of replies? > My somewhat naive intuition thought that I would want to somehow capture > the fact that a particular user has interacted more frequently with an item > vs. simply a boolean preference. Or, is it just that in the big scheme of > things, it won't matter much, so why complicate it? > > Thanks, > Grant > > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > Lucene Eurocon 2011: http://www.lucene-eurocon.com > > --20cf303b3ae519d9ce04ac4c793d--