spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Commented] (SPARK-12861) Changes to support KMeans with large feature space
Date Sat, 06 Feb 2016 18:33:39 GMT


Sean Owen commented on SPARK-12861:

I don't think that's the eventual topic of SPARK-4039 (see the discussion). It's the same
proposal as here. There's some discussion there about why you wouldn't want to do this, which
is part of why it was WontFix. I've always been under the impression that this becomes hard
to make meaningful with k-means because in high-dimensional space, everything is far from

For now you should just link them in JIRA (i'll link one).

I am not sure why you say you can't map from hashed feature space back to original one. You
can maintain this mapping, and the only problem is you can't disambiguate a hash collision.
You also can't use optimized dense matrix routines on this sparse rep.

> Changes to support KMeans with large feature space
> --------------------------------------------------
>                 Key: SPARK-12861
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 1.6.0
>            Reporter: Roy Levin
>              Labels: patch
>     The problem:
>     -----------------
>     In Spark's KMeans code the center vectors are always represented as dense vectors.
As a result, when each such center has a large domain space the algorithm quickly runs out
of memory. In my example I have a feature space of around 50000 and k ~= 500. This sums up
to around 200MB RAM for the center vectors alone while in fact the center vectors are very
sparse and require a lot less RAM.
>     Since I am running on a system with relatively low resources I keep getting OutOfMemory
errors. In my setting it is OK to trade off runtime for using less RAM. This is what I set
out to do in my solution while allowing users the flexibility to choose.
>     One solution could be to reduce the dimensions of the feature space but this is not
always the best approach. For example, when the object space is comprised of users and the
feature space of items. In such an example we may want to run kmeans over a feature space
which is a function of how many times user i clicked item j. If we reduce the dimensions of
the items we will not be able to map the centers vectors back to the items. Moreover in a
streaming context detecting the changes WRT previous runs gets more difficult.
>     My solution:
>     ----------------
>     Allow the kmeans algorithm to accept a VectorFactory which decides when vectors used
inside the algorithm should be sparse and when they should be dense. For backward compatibility
the default behavior is to always make them dense (like the situation is now). But now potentially
the user can provide a SmartVectorFactory (or some proprietary VectorFactory) which can decide
to make vectors sparse.
>     For this I made the following changes:
>     (1) Added a method called reassign to SparseVectors allowing to change the indices
and values
>     (2) Allow axpy to accept SparseVectors
>     (3) create a trait called VectorFactory and two implementations for it that are used
within KMeans code
>     To get the above described solution do the following:
>     git clone -b SupportLargeFeatureDomains
> Note
> ------
> There are some similar issues opened in JIRA in the past, e.g.:
> But the difference is that in the problem I describe reducing the dimensions of the problem
(i.e., the feature space) to allow using dense vectors is not suitable. Also, the solution
I implemented supports this while allowing full flexibility to the user --- i.e., using the
default dense vector implementation or selecting an alternative (only when the default it
is not desired). 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message