mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Singular vectors of a recommendation Item-Item space
Date Wed, 31 Aug 2011 15:04:08 GMT
Uhh...

A' is the transpose of A.  Not the inverse.

A' A *is* the summation version.

On Wed, Aug 31, 2011 at 1:24 AM, Lance Norskog <goksron@gmail.com> wrote:

> "Also, if your original matrix is A, then it is usually a shorter path to
> results to analyze the word (item) cooccurrence matrix A'A.  The methods
> below work either way."
>
> The cooccurrence definitions I'm finding only use the summation-based one
> in
> wikipedia. Are there any involving inverting the matrix instead? Or, as I
> suspect, are the two exactly the same but I don't know enough linear
> algebra?
>
> On Mon, Aug 29, 2011 at 3:28 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > Jeff,
> >
> > I think that this is a much simpler exposition:
> > http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
> >
> > It makes the connection with entropy clear and allows a very simple
> > implementation for more than 2x2 situations.
> >
> > More comments in-line:
> >
> > On Mon, Aug 29, 2011 at 1:34 PM, Jeff Hansen <dscheffy@gmail.com> wrote:
> >
> > > ...
> > > If we have a item/feature matrix (document/words, user/movies) then we
> > can
> > > consider each row or column to be a sample from larger reference
> > population
> > > of the full matrix.
> >
> >
> > Well, sort of.  If you have a document (user) and a word (item), then you
> > have a joint probability that any given interaction will be between this
> > document and word.  We pretend in this case that each interaction is
> > independent of every other which is patently not true, but very helpful.
> >
> > This joint probability can be approximated more or less accurately as the
> > product of document (user) and word (item) popularities.  In matrix
> terms,
> > this is the same as approximating the joint probability as a rank-1
> matrix
> > that is the outer produced of user and item probability distributions,
> each
> > of which is a vector.  The point of what we are trying to do is to find a
> > good and economical approximation of the joint probability that captures
> > important deviations from this rank-1 approximation.  There are two
> > problems
> > that come up in trying to find this approximation.  The first is that we
> > see
> > counts rather than probabilities and the second is that the matrix is
> big.
> >  Sometimes, it is very big.
> >
> >
> > > In that case the log likelihood significance for any
> > > given observation within the sample will be based on the observation
> > itself
> > > (1), the count of observations for the column (documents with this
> > > word/users that watched this movie), the count of observations for the
> > row
> > > (words in this document, movies this user has watched) and the total
> > number
> > > of any observation within the reference population.  Given those
> numbers,
> > > G2
> > > or log likelihood is just a calculation of a, b, c and d (values
> > described
> > > above respectively)
> > >
> >
> > As the blog I mentioned above makes more clear than original paper, you
> > need
> > to reduce the counts before taking them as entries in the contingency
> > table.
> >  If you are looking at the typical problem of vocabulary in documents,
> then
> > the discrepancy won't be huge, but if you have any very popular items,
> this
> > could be a problem.
> >
> > Given the log likelihood for each individual observation, is the best way
> > to
> > > apply it simply to remove observations that don't have a high enough
> > level
> > > of significance?
> >
> >
> > Roughly.  It is a bit misleading to talk about "significance" here since
> we
> > aren't trying to do a classic frequentist hypothesis test.  Instead, what
> > we
> > are trying to do is prioritize which joint terms we need to use to get a
> > usefully better approximation of the underlying joint distribution.
> >
> >
> > > Or is there a better way to normalize the values rather
> > > than removing less significant ones.
> >
> >
> > There are a lot of ways of approaching this.  Most don't do any better
> than
> > the simple expedient of just keeping the highest scoring k words for each
> > document (or items for each user).  Having k be the same for all users is
> > not a bad thing at all from a number of angles.  I often use k = 50 to
> 300.
> >
> > Also, in this case I feel like it makes more sense to treat Users like
> > > Documents and movies like observations of words within a document, so
> > > compare each User and their observations to
> > > the reference population rather than each Movie and the users that
> > reviewed
> > > it.
> >
> >
> > I think that you are on the right track here, but your second sentence
> made
> > a distinction I didn't understand.  User is to item as Document is to
> Word
> > is a fine metaphor for recording your observations. I strongly prefer
> > binary
> > observations, so I usually either count all ratings as 1 or all ratings
> > above a threshold as 1 with everything else being a zero.
> >
> > Whether you subsequently try to find users similar to each other or items
> > similar to each other doesn't much matter.  All are reasonable things to
> > try.  Whether they are useful is up to you and your customer to decide.
> >
> > Also, if your original matrix is A, then it is usually a shorter path to
> > results to analyze the word (item) cooccurrence matrix A'A.  The methods
> > below work either way.
> >
> >
> > > I realize this is a Mahout user list, so I feel somewhat guilty
> > constantly
> > > pasting in R code, but here's how I ended up implementing it.
> >
> >
> > I do it.  Use the right tools for the right task.  R is great for
> > prototyping.
> >
> >
> > > #rough equation that calculates log likelihood given a contingency
> table
> > of
> > > counts
> > > llr <- function(a,b,c,d){
> > >    r=(a+b)/(c+d)
> > >    e1=c*r
> > >    e2=d*r
> > >    g2=2*(a*log(a/e1)+b*log(b/e2))
> > >    return(g2)
> > > }
> > >
> >
> > So this code implies that you are arranging the elements of the
> contingency
> > table this way:
> >
> >  a    b  | a+b
> >  c    d  | c+d
> > ---------+--------
> > a+c  b+d | a+b+c+d
> >
> > But your later code you pass in data that uses this convention
> >
> >  a   b-a   | b
> >  c   d-b-c | d-b
> > -----------+-----
> > a+c  d-a-c | d
> >
> > I don't like this form because it makes code more complex overall and
> > breaks
> > important symmetries.
> >
> > #get counts to calculate log likelihood for function above
> > > a <- 1
> > > b <- apply(A,2,sum)
> >
> > c <- apply(A,1,sum)
> > >
> >
> > you can also use rowSums and columnSums
> >
> > d <- sum(A)
> > >
> > > #export sparse matrix A to a dataframe for calculating llr
> > > triples <- summary(A)
> > > #create b and c value vectors that sync up with the dataframe columns
> > > bx <- b[triples$j]
> > > cx <- c[triples$i]
> > > #caculate the log likelihood for each entry in the dataframe
> > > ll <- llr(a,bx,cx,d)
> > >
> >
> > This probably needs to be something more like
> >
> > llr(a, bx-a, d-cx, d-bx-cx+a)
> >
> > (I think)
> >
> > Also, summary produces different values for different kinds of matrices.
> >  As
> > you point out, it is very handy for sparse matrices, but it is nice to
> have
> > code that works on sparse or dense matrices.
> >
> >
> > > #create a logical operator vector that indicates whether each entry in
> > > dataframe is significant
> > > significant <- ll>3
> > >
> >
> > I think that a much higher cutoff is better.  Commonly a cutoff of 10-30
> is
> > useful.  Also, you may be better off if you can take the k-highest.
> >
> >
> > > #build a new sparse vector that only entries with significant enough
> log
> > > likelihood
> > > B <-
> sparseMatrix(i=triples$i[significant],j=triples$j[significant],x=1)
> > >
> >
> >
> >
> > >
> > > By the way, I always used to struggle with matrix multiplication (I
> could
> > > never quite remember which side was rows and which side was columns
> with
> > > out
> > > little mnemonics, but then again I've always struggled with remembering
> > my
> > > left from my right).  I recently realized it makes much more sense if
> you
> > > picture it as a cube in 3space -- if you match up the lower left hand
> > > corner
> > > of the right matrix with the upper right hand corner of the left matrix
> > and
> > > put the product in between them so you get a backwards capital L shape,
> > > then
> > > fold back the right and left matrices you get three faces of a cube (or
> > > cubic rectangle).  Then you just project inwards from the original
> > matrices
> > > and multiply where ever the values cross and sum them up those products
> > as
> > > you project them toward the face which is the resulting matrix.  Once
> you
> > > visualize it this way it makes slicing and dicing the operation into
> > > blockwise formulas trivial and obvious.  If only somebody had explained
> > it
> > > this way I would have found matrices much less intimidating -- has
> > anybody
> > > ever seen it explained this way?
> > >
> > >
> > >
> > >
> > > On Fri, Aug 26, 2011 at 10:21 PM, Lance Norskog <goksron@gmail.com>
> > wrote:
> > >
> > > >
> > > >
> > >
> >
> http://www.nytimes.com/interactive/2010/01/10/nyregion/20100110-netflix-map.html
> > > >
> > > > Do not fear demographics. Yes, some people rent movies with all-black
> > > > casts,
> > > > and other people rent movies with all-white casts. And the Walmarts
> in
> > > the
> > > > SF East Bay have palettes full of Tyler Perry videos, while most of
> the
> > > SF
> > > > Peninsula don't know his name.
> > > >
> > > > And sometimes a bunch of Army Rangers have a great time watching
> > > > "Confessions of a Shopaholic" in a tent, 120o farenheit, in Qatar.
> > > >
> > > > On Fri, Aug 26, 2011 at 8:29 AM, Jeff Hansen <dscheffy@gmail.com>
> > wrote:
> > > >
> > > > > Thanks for the math Ted -- that was very helpful.
> > > > >
> > > > > I've been using sparseMatrix() from libray(Matrix) -- largely based
> > on
> > > > your
> > > > > response to somebody elses email.  I've been playing with smaller
> > > > matrices
> > > > > mainly for my own learning purposes -- it's much easier to read
> > through
> > > > 200
> > > > > movies (most of which I've heard of) and get a gut feel, than
> 10,000
> > > > > movies.
> > > > >  It also means I don't have to shut down my session quite as often
> > > (I've
> > > > > been using rstudio) when I run something over the wrong
> > > dimension(column
> > > > vs
> > > > > row).  I was running into some limitations as well, but I think
> some
> > of
> > > > > those may have had more to do with my own typos and
> misunderstanding
> > of
> > > > > that
> > > > > language.
> > > > >
> > > > > There's a second reason I've been avoiding the tail -- while
> working
> > on
> > > > > this
> > > > > as a possible project to propose, I had a bit of a "duh" moment.
> > > >  Sometimes
> > > > > it's important to realize the real world constraints.  Picture a
> > > company
> > > > > with very physical locations and very limited shelf space (say a
> > > > > convenience
> > > > > store or even a vending machine...)  Netflix and Amazon can make
a
> > lot
> > > of
> > > > > money in the long tail because they have centralized and or digital
> > > > > inventories that service a large customer base.  Imagine a
> situation
> > > > where
> > > > > you have a small physical distributed inventory with an equally
> > > > distributed
> > > > > customer base.  In that situation it doesn't pay to chase after the
> > > tail
> > > > --
> > > > > you just need to reduce your costs and target the head where you
> get
> > > more
> > > > > volume with less items.  So you really end up with a three tier
> > market
> > > > > segmentation -- one strategy works best for the head, another for
> the
> > > > body,
> > > > > and a third for the tail.
> > > > >
> > > > > As far as clusters go -- I really wasn't finding any clusters at
> the
> > > > edges
> > > > > of the data, but that could have more to do with not including the
> > tail
> > > > > (and
> > > > > not normalizing appropriately for popularity).
> > > > >
> > > > > Incidentally, there was one amusingly cautionary anecdote I'd share
> > --
> > > I
> > > > > don't remember the exact spread, but at one point I had two
> extremes
> > > that
> > > > > included something like "the johnson family vacation", "my baby's
> > > daddy",
> > > > > and "barbershop 2" on the one side, and then titles like "mean
> > girls",
> > > > "50
> > > > > first dates" and "along came polly" on the other side.  You may
> have
> > to
> > > > > look
> > > > > up those titles to see what I'm talking about, but I'd say the
> > > > distinction
> > > > > will be pretty clear when you do -- and if you were simply
> automating
> > > all
> > > > > of
> > > > > this and not reviewing it with common sense, you could end up
> > offending
> > > > > some
> > > > > of your users...  All I'm saying is there may be trends in the real
> > > world
> > > > > that some people aren't comfortable having pointed out.
> > > > >
> > > > > On Fri, Aug 26, 2011 at 4:08 AM, Lance Norskog <goksron@gmail.com>
> > > > wrote:
> > > > >
> > > > > > This is a little meditation on user v.s. item matrix density.
The
> > > > > > heavy users and heavy items can be subsampled, once they are
> > > > > > identified. Hadoop's built-in sort does give a very simple
> > > > > > "map-increase" way to do this sort.
> > > > > >
> > > > > >
> > > http://ultrawhizbang.blogspot.com/2011/08/sorted-recommender-data.html
> > > > > >
> > > > > > On Thu, Aug 25, 2011 at 5:57 PM, Ted Dunning <
> > ted.dunning@gmail.com>
> > > > > > wrote:
> > > > > > > In matrix terms the binary user x item matrix maps a set
of
> items
> > > to
> > > > > > users
> > > > > > > (A h = users who interacted with items in h).  Similarly
A'
> maps
> > > > users
> > > > > to
> > > > > > > items.  Thus A' (A h) is the classic "users who x'ed this
also
> > x'ed
> > > > > that"
> > > > > > > sort of operation.  This can be rearranged to be (A'A)
h.
> > > > > > >
> > > > > > > This is where the singular values come in.  If
> > > > > > >
> > > > > > >     A \approx U S V'
> > > > > > >
> > > > > > > and we know that U'U = V'V = I from the definition of the
SVD.
> > >  This
> > > > > > means
> > > > > > > that
> > > > > > >
> > > > > > >    A' A \approx (U S V')' (U S V') = V S U' U S V' = V
S^2 V'
> > > > > > >
> > > > > > > But V' transforms an item vector into the reduced dimensional
> > space
> > > > so
> > > > > we
> > > > > > > can view this as transforming the item vector into the
reduced
> > > > > > dimensional
> > > > > > > space, scaling element-wise using S^2 and then transforming
> back
> > to
> > > > > item
> > > > > > > space.
> > > > > > >
> > > > > > > As you noted, the first right singular vector corresponds
to
> > > > > popularity.
> > > > > >  It
> > > > > > > is often not appropriate to just recommend popular things
> (since
> > > > users
> > > > > > have
> > > > > > > other means of discovering them) so you might want to drop
that
> > > > > dimension
> > > > > > > from the analysis.  This can be done in the SVD results
or it
> can
> > > be
> > > > > done
> > > > > > > using something like a log-likelihood ratio test so that
we
> only
> > > > model
> > > > > > > anomalously large values in A'A.
> > > > > > >
> > > > > > > Btw, if you continue to use R for experimentation, you
should
> be
> > > able
> > > > > to
> > > > > > > dramatically increase the size of the analysis you do by
using
> > > sparse
> > > > > > > matrices and a random projection algorithm.  I was able
to
> > compute
> > > > > SVD's
> > > > > > of
> > > > > > > matrices with millions of rows and columns and a few million
> > > non-zero
> > > > > > > elements in a few seconds.  Holler if you would like details
> > about
> > > > how
> > > > > to
> > > > > > do
> > > > > > > this.
> > > > > > >
> > > > > > > On Thu, Aug 25, 2011 at 4:07 PM, Jeff Hansen <
> dscheffy@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> One thing I found interesting (but not particularly
> surprising)
> > is
> > > > > that
> > > > > > the
> > > > > > >> biggest singular value/vector was pretty much tied
directly to
> > > > volume.
> > > > > > >>  That
> > > > > > >> makes sense because the best predictor of whether a
given
> fields
> > > > value
> > > > > > was
> > > > > > >> 1
> > > > > > >> was whether it belonged to a row with lots of 1s or
a column
> > with
> > > > lots
> > > > > > of
> > > > > > >> 1s
> > > > > > >> (I haven't quite figured out the best method to normalize
the
> > > values
> > > > > > with
> > > > > > >> yet).  When I plot the values of the largest singular
vectors
> > > > against
> > > > > > the
> > > > > > >> sum of values in the corresponding row or column, the
> > correlation
> > > is
> > > > > > very
> > > > > > >> linear.  That's not the case for any of the other singular
> > vectors
> > > > > > (which
> > > > > > >> actually makes me wonder if just removing that first
singular
> > > vector
> > > > > > from
> > > > > > >> the prediction might not be one of the best ways to
normalize
> > the
> > > > > data).
> > > > > > >>
> > > > > > >> I understand what you're saying about each singular
vector
> > > > > corresponding
> > > > > > to
> > > > > > >> a feature though.  Each left singular vector represents
some
> > > > abstract
> > > > > > >> aspect
> > > > > > >> of a movie and each right singular vector represents
users
> > > leanings
> > > > or
> > > > > > >> inclinations with regards to that aspect of the movie.
 The
> > > singular
> > > > > > value
> > > > > > >> itself just seems to indicate how good a predictor
the
> > combination
> > > > of
> > > > > > one
> > > > > > >> users inclination toward that aspect of a movie is
for coming
> up
> > > > with
> > > > > > the
> > > > > > >> actual value.  The issue I mentioned above is that
popularity
> of
> > a
> > > > > movie
> > > > > > as
> > > > > > >> well as how often a user watches movies tend to be
the best
> > > > predictors
> > > > > > of
> > > > > > >> whether a user has seen or will see a movie.
> > > > > > >>
> > > > > > >> I had been picturing this with the idea of one k dimensional
> > space
> > > > --
> > > > > > one
> > > > > > >> where a users location corresponds to their ideal prototypical
> > > > movies
> > > > > > >> location. Not that there would be a movie right there,
but
> there
> > > > would
> > > > > > be
> > > > > > >> ones nearby, and the nearer they were, the more enjoyable
they
> > > would
> > > > > be.
> > > > > > >>  That's a naive model, but that doesn't mean it wouldn't
work
> > well
> > > > > > enough.
> > > > > > >>  My problem is I don't quite get how to map the user
space
> over
> > to
> > > > the
> > > > > > item
> > > > > > >> space.
> > > > > > >>
> > > > > > >> I think that may be what Lance is trying to describe
in his
> last
> > > > > > response,
> > > > > > >> but I'm falling short on reconstructing the math from
his
> > > > description.
> > > > > > >>
> > > > > > >> I get the following
> > > > > > >>
> > > > > > >> U S V* = A
> > > > > > >> U S = A V
> > > > > > >> U = A V 1/S
> > > > > > >>
> > > > > > >> I think that last line is what Lance was describing.
 Of
> course
> > my
> > > > > > problem
> > > > > > >> was bootstrapping in a user for whom I don't know A
or V.
> > > > > > >>
> > > > > > >> I also think I may have missed a big step of the puzzle.
 For
> > some
> > > > > > reason I
> > > > > > >> thought that just by loosening the rank, you could
recompose
> the
> > > > > Matrix
> > > > > > A
> > > > > > >> from the truncated SVD values/vectors and use the recomposed
> > > values
> > > > > > >> themselves as the recommendation.  I thought one of
the ideas
> > was
> > > > that
> > > > > > the
> > > > > > >> recomposed matrix had less "noise" and could be a better
> > > > > representation
> > > > > > of
> > > > > > >> the underlying nature of the matrix than the original
matrix
> > > itself.
> > > > > >  But
> > > > > > >> that may have just been wishful thinking...
> > > > > > >>
> > > > > > >> On Thu, Aug 25, 2011 at 4:50 PM, Sean Owen <srowen@gmail.com>
> > > > wrote:
> > > > > > >>
> > > > > > >> > The 200x10 matrix is indeed a matrix of 10 singular
vectors,
> > > which
> > > > > are
> > > > > > >> > eigenvectors of AA'. It's the columns, not rows,
that are
> > > > > > >> > eigenvectors.
> > > > > > >> >
> > > > > > >> > The rows do mean something. I think it's fair
to interpret
> the
> > > 10
> > > > > > >> > singular values / vectors as corresponding to
some
> underlying
> > > > > features
> > > > > > >> > of tastes. The rows say how much each user expresses
those
> 10
> > > > > tastes.
> > > > > > >> > The matrix of right singular vectors on the other
side tells
> > you
> > > > the
> > > > > > >> > same thing about items. The diagonal matrix of
singular
> values
> > > in
> > > > > the
> > > > > > >> > middle also comes into play -- it's like a set
of
> multipliers
> > > that
> > > > > say
> > > > > > >> > how important each feature is. (This is why we
cut out the
> > > > singular
> > > > > > >> > vectors / values that have the smallest singular
values;
> it's
> > > like
> > > > > > >> > removing the least-important features.) So really
you'd have
> > to
> > > > > stick
> > > > > > >> > those values somewhere; Ted says it's conventional
to put
> > "half"
> > > > of
> > > > > > >> > each (their square roots) with each side if anything.
> > > > > > >> >
> > > > > > >> > I don't have as good a grasp on an intuition for
the columns
> > as
> > > > > > >> > eigenvectors. They're also a set of basis vectors,
and I had
> > > > > > >> > understood them as like the "real" bases of the
reduced
> > feature
> > > > > space
> > > > > > >> > expressed in user-item space. But I'd have to
go back and
> > think
> > > > that
> > > > > > >> > intuition through again since I can't really justify
it from
> > > > scratch
> > > > > > >> > again in my head just now.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Thu, Aug 25, 2011 at 10:21 PM, Jeff Hansen
<
> > > dscheffy@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >> > > Well, I think my problem may have had more
to do with what
> I
> > > was
> > > > > > >> calling
> > > > > > >> > the
> > > > > > >> > > eigenvector...  I was referring to the rows
rather than
> the
> > > > > columns
> > > > > > of
> > > > > > >> U
> > > > > > >> > and
> > > > > > >> > > V.  While the columns may be characteristic
of the overall
> > > > matrix,
> > > > > > the
> > > > > > >> > rows
> > > > > > >> > > are characteristic of the user or item (in
that they are a
> > > rank
> > > > > > reduced
> > > > > > >> > > representation of that person or thing).
I guess you could
> > say
> > > I
> > > > > > just
> > > > > > >> had
> > > > > > >> > to
> > > > > > >> > > tilt my head to the side and change my perspective
90
> > degrees
> > > =)
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Lance Norskog
> > > > > > goksron@gmail.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Lance Norskog
> > > > goksron@gmail.com
> > > >
> > >
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message