Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C00ECE603 for ; Mon, 26 Nov 2012 21:52:04 +0000 (UTC) Received: (qmail 42689 invoked by uid 500); 26 Nov 2012 21:52:03 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 42648 invoked by uid 500); 26 Nov 2012 21:52:03 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 42639 invoked by uid 99); 26 Nov 2012 21:52:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Nov 2012 21:52:03 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of karataev.evgeny@gmail.com designates 209.85.219.42 as permitted sender) Received: from [209.85.219.42] (HELO mail-oa0-f42.google.com) (209.85.219.42) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Nov 2012 21:51:58 +0000 Received: by mail-oa0-f42.google.com with SMTP id j1so24792934oag.1 for ; Mon, 26 Nov 2012 13:51:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=uV+22LhxxIOPWEkFllPkkihhGv0UOfEccuuPDGQaJ4Q=; b=vkYtFvYk50nT+lSOykX3le45N/Duv/dM5/Gh/NtR3pSdC/3FOWBNqf5wjxjmT5tC+v wYt9D8BKa6y0ybzM9/seldpZQGaI0cTv/Y87TDY0wKEGqCnqxK48ncimMlO9u19ZLic1 8iG+/nZhcifXCFa85ND7vQluu0+MrOc0ZF5O9PqcLbdZdmzJA5G2ySpP/0bjSeoKEGEc JGCVGWE1Ztpsq0juc8bVyTVugJ64iVWVcLFBBFRb7hJYA3kCjKJPKF34OwoC7EnYaI2L fGQHUEKopkJR+1aceuozF8BpL9h815DzBmzEM1MHReMDHmeCJk/qlhLf1QE9ife3LGbB oGkg== Received: by 10.60.27.97 with SMTP id s1mr10295166oeg.6.1353966698121; Mon, 26 Nov 2012 13:51:38 -0800 (PST) MIME-Version: 1.0 Received: by 10.76.157.136 with HTTP; Mon, 26 Nov 2012 13:51:17 -0800 (PST) In-Reply-To: <50B3DFE7.2030107@tid.es> References: <50B3CEF8.4050302@tid.es> <50B3DFE7.2030107@tid.es> From: Evgeny Karataev Date: Mon, 26 Nov 2012 16:51:17 -0500 Message-ID: Subject: Re: Recommender's formula To: user Content-Type: multipart/alternative; boundary=e89a8fb1ef4a1448f104cf6cf052 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8fb1ef4a1448f104cf6cf052 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thank you Sean and Paulo. Paulo, I guess in my original email I meant what you said in your last email (about rating normalization). So that part is not done. I've looked at the code https://github.com/apache/** mahout/blob/trunk/core/src/**main/java/org/apache/mahout/** cf/taste/impl/recommender/**GenericItemBasedRecommender.**java#L230 and the formula looks almost exactly as formula 4.12 in "A Comprehensive Survey of Neighborhood-based Recommendation Methods" ( http://www.springerlink.com/content/n3jq77686228781n/), however, the difference is that you divide weighted preference by totalSimilarity ... // Weights can be negative! preference +=3D theSimilarity * preferencesFromUser.getValue(i); totalSimilarity +=3D theSimilarity; ... float estimate =3D (float) (preference / totalSimilarity); ... Where in contrast, in other papers the denominator is sum of absolute values of similarities.* * If I am not mistaken and as the comment in the code states, weights (similarities) could be negative. And actually they might sum up to 0. Then you would divide preference by 0. What would be the estimate in that case? On Mon, Nov 26, 2012 at 4:32 PM, Paulo Villegas wrote: > > What do you mean here? You never need to actually subtract the mean > > from the data. The similarity metric's math is just adjusted to work > > as if it were. So no there is no idea of adding back a mean. I don't > > think there's something not implemented. > > No, not about the similarity metric, as I said, the computation of the > similarity metric *is* centred (or can be, the code has that option). > > But once you have similarities computed, then you go on and use them to > predict the rating for unknown items. It's this rating prediction the > place in which mean centering (or, to be more general, rating > normalization) is not done and could be done. > > The papers mentioned in the original post explain it, I just searched > around and found another one that also mentions it: > > "An Empirical Analysis of Design Choices in Neighborhood-Based > Collaborative Filtering Algorithms" > > (googling it will give you a PDF right away). The rating prediction is > Equation 1, and there you can see what I mean by mean centering in the > prediction. > > Basically, you use the similarities you have already computed as weights > for the averaging sum that creates the prediction, but those weights do > not multiply the bare ratings for the other items, but their deviation > from each users' average rating (equation 1 is for user-based). > > The rationale is that each user's scale is different, and tends to > cluster ratings around a different mean. By subtracting that mean, we > get into the equation only the user's perceived difference between that > item and her average opinion, and factor out the user's mean opinion > (which would introduce some bias). Then we add back to the result the > average rating of the target user, which restores the normal range for > the prediction, but this time using the target user's own bias. This > helps to achieve predictions more in line with the target user's own scal= e. > > The same paper explains it later on (more eloquently than me :-) in > section 7.1, in the more general context of rating normalization > (proposing also z-score as a more elaborate choice, and evaluating > results). > > Paulo > > > On 26/11/12 21:51, Sean Owen wrote: > >> >> On Mon, Nov 26, 2012 at 8:20 PM, Paulo Villegas wrote: >> >>> The thing is, in an Item- or User- based neighborhood recommender, >>> there's more than one thing that can be centered :-) >>> >>> What those papers talk about (from memory, it's been a while since I >>> last read them, and I don't have them at hand now) is about centering o= f >>> the preference around the user's (or item's) average before entering it >>> in the neighborhood formula. And then moving them back to its usual >>> range by adding back the average preference (this time for the target >>> item or user). >>> >>> This is something that the code in Mahout does not currently do. You ca= n >>> check for yourself, the formula is pretty straightforward: >>> >> > > ______________________________**__ > > Este mensaje se dirige exclusivamente a su destinatario. Puede consultar > nuestra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el= enlace > situado m=E1s abajo. > This message is intended exclusively for its addressee. We only send and > receive email on the basis of the terms set out at: > http://www.tid.es/ES/PAGINAS/**disclaimer.aspx > --=20 Best Regards, Evgeny Karataev --e89a8fb1ef4a1448f104cf6cf052--