Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: domain of karataev.evgeny@gmail.com
 designates 209.85.219.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <50B3DFE7.2030107@tid.es>
References: 
 <CAL=3XKpE4p2ep4ijL7jUiTWyWg5Azmo28V1mtp-TJv29Z+yiEQ@mail.gmail.com>
 <50B3CEF8.4050302@tid.es>
 <CAEccTyxsfstMGi2Z0iKWG_0oKAaHVTXaMdf-kwtW5HMHk9zPsg@mail.gmail.com>
 <50B3DFE7.2030107@tid.es>
From: Evgeny Karataev <karataev.evgeny@gmail.com>
Date: Mon, 26 Nov 2012 16:51:17 -0500
Message-ID: 
 <CAL=3XKqp9PMhkpqXHNHQ6xLS+K47fSa7W2u4YxO3f3uK1N15qg@mail.gmail.com>
Subject: Re: Recommender's formula
To: user <user@mahout.apache.org>
Content-Type: multipart/alternative; boundary=e89a8fb1ef4a1448f104cf6cf052

--e89a8fb1ef4a1448f104cf6cf052
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thank you Sean and Paulo.

Paulo, I guess in my original email I meant what you said in your last
email (about rating normalization). So that part is not done.

I've looked at the code https://github.com/apache/**
mahout/blob/trunk/core/src/**main/java/org/apache/mahout/**
cf/taste/impl/recommender/**GenericItemBasedRecommender.**java#L230<https:/=
/github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/c=
f/taste/impl/recommender/GenericItemBasedRecommender.java#L230>

and the formula looks almost exactly as formula 4.12 in "A Comprehensive
Survey of Neighborhood-based Recommendation Methods" (
http://www.springerlink.com/content/n3jq77686228781n/), however, the
difference is that you divide weighted preference by totalSimilarity

   ...


 // Weights can be negative!
preference +=3D theSimilarity * preferencesFromUser.getValue(i);
totalSimilarity +=3D theSimilarity;
...
float estimate =3D (float) (preference / totalSimilarity);
...

Where in contrast, in other papers the denominator is sum of absolute
values of similarities.*
*

If I am not mistaken and as the comment in the code states, weights
(similarities) could be negative. And actually they might sum up to 0.
Then you would divide preference by 0. What would be the estimate in
that case?


On Mon, Nov 26, 2012 at 4:32 PM, Paulo Villegas <paulo@tid.es> wrote:

> > What do you mean here? You never need to actually subtract the mean
> > from the data. The similarity metric's math is just adjusted to work
> > as if it were. So no there is no idea of adding back a mean. I don't
> > think there's something not implemented.
>
> No, not about the similarity metric, as I said, the computation of the
> similarity metric *is* centred (or can be, the code has that option).
>
> But once you have similarities computed, then you go on and use them to
> predict the rating for unknown items. It's this rating prediction the
> place in which mean centering (or, to be more general, rating
> normalization) is not done and could be done.
>
> The papers mentioned in the original post explain it, I just searched
> around and found another one that also mentions it:
>
> "An Empirical Analysis of Design Choices in Neighborhood-Based
> Collaborative Filtering Algorithms"
>
> (googling it will give you a PDF right away). The rating prediction is
> Equation 1, and there you can see what I mean by mean centering in the
> prediction.
>
> Basically, you use the similarities you have already computed as weights
> for the averaging sum that creates the prediction, but those weights do
> not multiply the bare ratings for the other items, but their deviation
> from each users' average rating (equation 1 is for user-based).
>
> The rationale is that each user's scale is different, and tends to
> cluster ratings around a different mean. By subtracting that mean, we
> get into the equation only the user's perceived difference between that
> item and her average opinion, and factor out the user's mean opinion
> (which would introduce some bias). Then we add back to the result the
> average rating of the target user, which restores the normal range for
> the prediction, but this time using the target user's own bias. This
> helps to achieve predictions more in line with the target user's own scal=
e.
>
> The same paper explains it later on (more eloquently than me :-) in
> section 7.1, in the more general context of rating normalization
> (proposing also z-score as a more elaborate choice, and evaluating
> results).
>
> Paulo
>
>
> On 26/11/12 21:51, Sean Owen wrote:
>
>>
>> On Mon, Nov 26, 2012 at 8:20 PM, Paulo Villegas <paulo@tid.es> wrote:
>>
>>> The thing is, in an Item- or User- based neighborhood recommender,
>>> there's more than one thing that can be centered :-)
>>>
>>> What those papers talk about (from memory, it's been a while since I
>>> last read them, and I don't have them at hand now) is about centering o=
f
>>> the preference around the user's (or item's) average before entering it
>>> in the neighborhood formula. And then moving them back to its usual
>>> range by adding back the average preference (this time for the target
>>> item or user).
>>>
>>> This is something that the code in Mahout does not currently do. You ca=
n
>>> check for yourself, the formula is pretty straightforward:
>>>
>>
>
> ______________________________**__
>
> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
> nuestra pol=EDtica de env=EDo y recepci=F3n de correo electr=F3nico en el=
 enlace
> situado m=E1s abajo.
> This message is intended exclusively for its addressee. We only send and
> receive email on the basis of the terms set out at:
> http://www.tid.es/ES/PAGINAS/**disclaimer.aspx<http://www.tid.es/ES/PAGIN=
AS/disclaimer.aspx>
>


--=20
Best Regards,
Evgeny Karataev

--e89a8fb1ef4a1448f104cf6cf052--