# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Amit Nithian <anith...@gmail.com>
Subject Re: Question about Pearson Correlation in non-Taste mode
Date Sat, 30 Nov 2013 06:16:52 GMT
```Hi Ted,

Thanks for your response. I thought that the mean of a sparse vector is
simply the mean of the "defined" elements? Why would the vectors become
dense unless you're meaning that all the undefined elements (0?) now will
be (0-m_x)?

Looking at the following example:
X = [5 - 4] and Y= [4 5 2].

is m_x 4.5 or 3? Is m_y 11/3 or (6/2) because we ignore the "5" since it's
counterpart in X is undefined?.

Thanks again
Amit

On Fri, Nov 29, 2013 at 9:58 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Well, the best way to compute correlation using sparse vectors is to make
> sure you keep them sparse.  To do that, you must avoid subtracting the mean
> by expanding whatever formulae you are using.  For instance, if you are
> computing
>
>     (x - m_x) . (y - m_y)
>
> (here . means dot product)
>
> If you do this directly, then you lose all benefit of sparse vectors since
> subtracting the means makes each vector dense.
>
> What you should compute instead is this alternative form
>
>    x . y - m_x e . y - m_y e . x + m_x m_y
>
> (here e represents a vector full of 1's)
>
> The dot product here is sparse and the expression m_x e . y can be computed
> (at lease in Mahout) in map-reduce idiom as
>
>     y.aggregate(Functions.PLUS, Functions.mult(m_x))
>
>
>
>
> On Fri, Nov 29, 2013 at 9:31 PM, Amit Nithian <anithian@gmail.com> wrote:
>
> > Okay so I rethought my question and realized that the paper never really
> > talked about collaborative filtering but just how to calculate item-item
> > similarity in a scalable fashion. Perhaps this is the reason for why the
> > common ratings aren't used? Because that's not a pre-req for this
> > calculation?
> >
> > Although for my own clarity, I'd still like to get a better understanding
> > of what it means to calculate the correlation between sparse vectors
> where
> > you're normalizing each vector using a separate denominator.
> >
> > P.S. If my question(s) don't make sense please let me know for it's very
> > possible I am completely misunderstanding something :-).
> >
> > Thanks again!
> > Amit
> >
> >
> > On Wed, Nov 27, 2013 at 8:23 AM, Amit Nithian <anithian@gmail.com>
> wrote:
> >
> > > Hey Sebastian,
> > >
> > > Thanks again. Actually I'm glad that I am talking to you as it's your
> > > paper and presentation I have questions with! :-)
> > >
> > > So to clarify my question further, looking at this presentation (
> > > http://isabel-drost.de/hadoop/slides/collabMahout.pdf) you have the
> > > following user x item matrix:
> > >     M   A   I
> > > A  5    1   4
> > > B  -    2    5
> > > P  4   3    2
> > >
> > > If I want to calculate the pearson correlation between Matrix and
> > > Inception, I'd have the rating vectors:
> > > [5 - 4] vs [4 5 2].
> > >
> > > One of the steps in your paper is the normalization step which
> subtracts
> > > the mean item rating from each value and essentially do the L2Norm of
> > this
> > > resulting vector (or in other words, the L2 norm of the mean-centered
> > > vector ?)
> > >
> > > The question I have had is what is the average rating for Matrix and
> > > Inception? I can see the following:
> > > Matrix - 4.5 (9/2), Inception - 3 (6/2) because you only consider
> shared
> > > ratings
> > > Matrix - 3 (9/3), Inception - 3.667 (11/3) assuming that the missing
> > > rating is 0
> > > Matrix - 4.5 (9/2), Inception - 3.667 (11/3) subtract from the average
> of
> > > all non-zero ratings ==> This is what I believe the current
> > implementation
> > > does.
> > >
> > > Unfortunately, neither of these yield the 0.47 listed in the
> presentation
> > > but that's a separate issue. In my testing, I see that Mahout Taste
> > > (non-distributed) uses the 1st approach while the distributed approach
> > uses
> > > the 3rd approach.
> > >
> > > I am okay with #3; however I just want to understand that this is the
> > case
> > > and that it's okay. This is why I was asking about pearson correlation
> > > between vectors of "different" lengths because the average rating is
> > being
> > > computed using a denominator (number of users) that is different
> between
> > > the two (2 vs 3).
> > >
> > > I know you said in practice that people don't use Pearson to compute
> > > inferred ratings but this is just for my complete understanding (and
> > since
> > > it's the example used in your presentation). This same question applies
> > to
> > > cosine as you are doing an L2-Norm of the vector as a pre-processing
> step
> > > and including/excluding non-shared ratings may make a difference.
> > >
> > > Thanks again!
> > > Amit
> > >
> > >
> > > On Wed, Nov 27, 2013 at 7:13 AM, Sebastian Schelter <
> > >
> > >> Hi Amit,
> > >>
> > >> Yes, it gives different results. However in practice, most people
> don't
> > >> do rating prediction with Pearson coefficient, but use count-based
> > >> measures like the loglikelihood ratio test.
> > >>
> > >> The distributed code doesn't look at vectors of different lengths, but
> > >> simply assumes non-existent ratings as zero.
> > >>
> > >> --sebastian
> > >>
> > >> On 27.11.2013 16:09, Amit Nithian wrote:
> > >> > Comparing this against the non distributed (taste) gives different
> > >> > for item item similarity as of course the non distributed looks only
> > at
> > >> > corated items. I was more wondering if this difference in practice
> > >> mattered
> > >> > or not.
> > >> >
> > >> > Also I'm confused on how you can compute the Pearson similarity
> > between
> > >> two
> > >> > vectors of different length which essentially is going on here I
> > think?
> > >> >
> > >> > Thanks again
> > >> > Amit
> > >> > On Nov 27, 2013 9:06 AM, "Sebastian Schelter" <
> > >> > wrote:
> > >> >
> > >> >> Yes, it is due to the parallel algorithm which only looks at
> > co-ratings
> > >> >> from a given user.
> > >> >>
> > >> >>
> > >> >> On 27.11.2013 15:02, Amit Nithian wrote:
> > >> >>> Thanks Sebastian! Is there a particular reason for that?
> > >> >>> On Nov 27, 2013 7:47 AM, "Sebastian Schelter" <
> > >> >>> wrote:
> > >> >>>
> > >> >>>> Hi Amit,
> > >> >>>>
> > >> >>>> You are right, the non-corated items are not filtered
out in the
> > >> >>>> distributed implementation.
> > >> >>>>
> > >> >>>> --sebastian
> > >> >>>>
> > >> >>>>
> > >> >>>> On 26.11.2013 20:51, Amit Nithian wrote:
> > >> >>>>> Hi all,
> > >> >>>>>
> > >> >>>>> Apologies if this is a repeat question as I just joined
the list
> > >> but I
> > >> >>>> have
> > >> >>>>> a question about the way that metrics like Cosine
and Pearson
> are
> > >> >>>>> calculated in Hadoop "mode" (i.e. non Taste).
> > >> >>>>>
> > >> >>>>> As far as I understand, the vectors used for computing
pairwise
> > item
> > >> >>>>> similarity in Taste are based on the co-rated items;
however, in
> > the
> > >> >>>>> implementation, I don't see this done.
> > >> >>>>>
> > >> >>>>> The implementation of the distributed item-item similarity
comes
> > >> from
> > >> >>>> this
> > >> >>>>> paper
> > .
> > >> I
> > >> >>>> didn't
> > >> >>>>> see anything in this paper about filtering out those
elements
> from
> > >> the
> > >> >>>>> vectors not co-rated and this can make a difference
especially
> > when
> > >> you
> > >> >>>>> normalize the ratings by dividing by the average item
rating. In
> > >> some
> > >> >>>>> cases, the # users to divide by can be fewer depending
on the
> > >> >> sparseness
> > >> >>>> of
> > >> >>>>> the vector.
> > >> >>>>>
> > >> >>>>> Any clarity on this would be helpful.
> > >> >>>>>
> > >> >>>>> Thanks!
> > >> >>>>> Amit
> > >> >>>>>
> > >> >>>>
> > >> >>>>
> > >> >>>
> > >> >>
> > >> >>
> > >> >
> > >>
> > >>
> > >
> >
>

```
Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message