mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schulte <johannes.schu...@gmail.com>
Subject Re: K-Means as a surrogate for Matrix Factorization
Date Fri, 05 Oct 2012 15:57:52 GMT
Hi Ted,

thanks for the hints. I am however wondering what the reverse projection
would be needed for. Do you mean for explaining stuff only? Or validating a
model manually?

Also, your idea is to first reduce the dimensionality via random projection
(as opposed to matrix factorization??) and then do a clustering in the new
space to derive features, right?

Can you point out how that would be different from using a svd reduction as
a feature generation technique? If I got it right it's for scalability /
performance reasons, right?

I got the feeling that for me it's easier to start with a simple KMeans
code and tweak that if you got only a single machine at hand. All the
non-distributed MF algorithms are either slow or not really suited for
binary data, if i get everything right. With k-Means i can avoid my biggest
factor (users).

I'm really looking forward to the streaming k-means stuff!

Cheers,

Johannes

On Fri, Oct 5, 2012 at 3:47 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Johannes,
>
> Funny you should mention matrix factorization and k-means at the same
> moment.  I am talking this afternoon in Oxford about just this topic.
>
> Yes, you can use the proximity to near clusters as a useful modeling
> feature, but as Sean said, the cost of matrix factorization should not be
> the motivating factor for that.  There is also a lot of runtime on matrix
> factor recommenders and little on radial basis function ones.
>
> I have been noodling for a bit that reducing the dimension through random
> projection and then doing k-means in that space and the associated cluster
> proximities would make a good basis space for almost any modeling, but
> reversing the mapping through the stochastic projection requires that you
> solve a compressive sensing sort of problem which involves L_1 regularized
> solution of an underdetermined linear system.  This will be almost exactly
> the same amount of work as doing a stochastic projection SVD and thus quite
> similar to the ALS sorts of solvers as well.
>
> Clustering can also be quite expensive.  If you use the new k-means stuff
> that isn't quite in Mahout yet, you can get the cost down quite low, but
> that is very new code.  It does give two or three orders of magnitude
> speedup, but it has zero production runtime.
>
> I would thus only expect a win if you use the new clustering and don't need
> explanations and can arrange recommendations to avoid solving the reverse
> problem.
>
> On Fri, Oct 5, 2012 at 12:42 PM, Johannes Schulte <
> johannes.schulte@gmail.com> wrote:
>
> > Sean,
> >
> > thanks for your input.
> >
> > It's more like 30 million users + id mapping for both items and users,
> but
> > i could probably sample that to something that fits into memory
> >
> > My main point was that I'm more interested in the decomposition of users
> > into a linear combination of factors, as this is needed to construct
> > feature vectors for a multinominal regression. Thats why MF came to my
> > mind. (It would also have the nice effect that all user*feature vectors
> > would be present after the factorization, ready to be fed into
> > LogisticRegression)
> >
> > The recommendation afterwards is more driven by expected profit (per
> > multinominal target class), where the matrix factorization helps in the
> way
> > that it can almost always recommend something given a restriction of
> > "CandidateItems", while neighbourhood models (at least pre computed
> > item-item similarities) are not suitable for this ad hoc restrictions.
> >
> > So since I'm using the user-item interaction more like clusters i was
> > wondering if it wouldn't reduce complexity to leave the user out of the
> > equation. The distance I was planning to use was cosine distance anyway,
> > but thanks for the warning.
> >
> > I think I'll try both and if something exciting happens I'm gonna tell
> the
> > whole world (probably not).
> >
> > Have a nice day!
> >
> > Johannes
> >
> > On Fri, Oct 5, 2012 at 12:40 PM, Sean Owen <srowen@gmail.com> wrote:
> >
> > > This doesn't have to be hard to do in real-time. With, say, 50
> > > features, then 10 million users and items take up 2GB of heap
> > > (assuming floats not doubles).
> > >
> > > You can still compute a factored matrix model regularly in the
> > > background, while folding in new data in real time, projecting new
> > > users / data into the space approximately.
> > >
> > > What you're suggesting sounds like a neighborhood-based approach, that
> > > just throws in clustering as a speed-up. Yes that is broadly a valid
> > > approach.
> > >
> > > The L2 norm / squared error loss function does show up in NNMF, where
> > > you're minimizing the reconstruction error. You also frequently
> > > minimize Euclidean distances in k-means, which is also squared error
> > > loss. I don't think that's a reason they're both necessarily
> > > applicable here, though I think both are.
> > >
> > > I am not sure the Euclidean distance is going to give reasonable
> > > results in such high dimensional and sparse space and (I think?)
> > > people usually swap in the L1 norm for example. Others can give more
> > > informed commentary on that. This is part of the reason you'd do
> > > dimensionality reduction to begin with, to make the clustering work.
> > >
> > > But if you're doing that you have a basis for recommendations already,
> > > if you've already done the NNMF. I wouldn't rule it out. To me I think
> > > adding k-means in the mix invites more questions and potential issues
> > > on this input than it may resolve.
> > >
> > >
> > > (This is exactly what Myrrix does, and this scale wouldn't be a big
> > > deal, most especially since you can just partition by user and scale
> > > up to as many servers as you want. I would encourage you look at
> > > http://myrrix.com as it sounds like it matches up strongly with your
> > > situation.)
> > >
> > > Sean
> > >
> > > On Fri, Oct 5, 2012 at 10:44 AM, Johannes Schulte
> > > <johannes.schulte@gmail.com> wrote:
> > > > Hi!
> > > >
> > > > I got a question concerning a recommendation / classification problem
> > > which
> > > > i originally wanted to solve with matrix factorization methods from
> > > taste /
> > > > mahout.
> > > >
> > > > It has the following properties.
> > > >
> > > > - There are about ~200k items
> > > > - There are a lot more users (say, millions) and they are very
> volatile
> > > > (like sessions)
> > > > - There is no need for the user factor matrix since the
> recommendation
> > is
> > > > very "near-time" dependent. At the time of deployment the user
> factors
> > > need
> > > > to be constructed from the items they interacted with in the last
> > > seconds,
> > > > hence relying on an hourly deployment cycle is not suitable.
> > > > - The double user factor arrays for the matrix factorization
> technique
> > > > become very large
> > > >
> > > > The question now is:
> > > >
> > > > Given that im only interested in item latent features, how differs
> that
> > > > from a soft k-means clustering over items (with coocurrence vectors?)
> > > > I think the recommendation then could also be expressed as a linear
> > > > combination of distances to clusters. Some papers suggest that nmf
> and
> > > > k-means use basically the same loss function, so i hope it's not a
> > > totally
> > > > stupid idea.
> > > >
> > > > The cluster membership vectors (or latent features) should be used
> > later
> > > on
> > > > as input to a regression model, that's why a neighbourhood approach
> > > doesn't
> > > > fit
> > > >
> > > > The main benefit for me would be
> > > >
> > > > 1. Simplicity
> > > > 2. Performance ( I don't need a running cluster for K-Means it works
> > > pretty
> > > > well on one machine, as opposed to mf)
> > > > 3. Maybe more freedom to include side information into the clustering
> > > > without implementing a new mf technique in mahout
> > > > 4. Incremental updates of clusters to model variations over time,
> maybe
> > > > someday with the streaming k-means thing
> > > >
> > > > Thanks for time, i'd appreciate any opinions
> > > >
> > > > Johannes
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message