spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiaoli Li <>
Subject Re: Huge matrix
Date Sat, 12 Apr 2014 18:20:35 GMT
Hi Tom,

Thank you very much for your detailed explanation. I think it is very
helpful to me.

On Sat, Apr 12, 2014 at 1:06 PM, Tom V <> wrote:

> The last writer is suggesting using the triangle inequality to cut down
> the search space.  If c is the centroid of cluster C, then the closest any
> point in C is to x is ||x-c|| - r(C), where r(C) is the (precomputed)
> radius of the cluster---the distance of the farthest point in C to c.
>  Whether you want to use the bound for an exact algorithm or a search
> heuristic is up to you.
> Another approach can be very successful if the user-attributes matrix is
> very sparse:  The idea is to exploit the additive nature of common
> similarity measures.  I have used this for cosine and Bayesian posterior.
>  Here is the basic idea: A "User" index contains all the attributes for a
> certain user (sorted by strength), and an "Attribute" index contains all
> the users with a certain attribute.  (Again, sorted by strength).  To find
> the best k matches for user i, start with user i's strongest attribute and
> search the attribute index to find some users with high scores for that
> attribute.  Move to user i's next best attribute.  If you proceed with a
> "diagonal frontier" (as you fetch candidates bases on i's weaker
> attributes, also go back and fetch more weak candidates for i's strong
> attributes), you can get a candidate pool and a bound on the highest score
> of any user not in the candidate pool.  You just expand the candidate pool
> until the bound drops below the scores of the k best candidates.
>  Practically, you'd want to limit the candidate pool size, but it's almost
> always exact.
> For a million users, you should be able to distribute the things needed to
> make a recommendation (either the centroids or the attributes matrix), and
> just break up the work based on the users you want to generate
> recommendations for.  I hope this helps.
> Tom
> On Sat, Apr 12, 2014 at 11:35 AM, Xiaoli Li <>wrote:
>> Hi Guillaume,
>> This sounds a good idea to me. I am a newbie here. Could you further
>> explain how will you determine which clusters to keep? According to the
>> distance between each element with each cluster center?
>> Will you keep several clusters for each element for searching nearest
>> neighbours? Thanks.
>>   Xiaoli
>> On Sat, Apr 12, 2014 at 9:46 AM, Guillaume Pitel <
>>> wrote:
>>>  Hi,
>>> I'm doing this here for multiple tens of millions of elements (and the
>>> goal is to reach multiple billions), on a relatively small cluster (7 nodes
>>> 4 cores 32GB RAM). We use multiprobe KLSH. All you have to do is run a
>>> Kmeans on your data, then compute the distance between each element with
>>> each cluster center, keep a few clusters and only look into these clusters
>>> for nearest neighbours.
>>> This method is known to perform very well and vastly speedup your
>>> computation
>>> The hardest part is to decide how many clusters to compute, and how many
>>> to keep. As a rule of thumb, I generally want 300-10000 elements per
>>> cluster, and use 5-20 clusters.
>>> Guillaume
>>>  I am implementing an algorithm using Spark. I have one million users.
>>> I need to compute the similarity between each pair of users using some
>>> user's attributes.  For each user, I need to get top k most similar users.
>>> What is the best way to implement this?
>>>  Thanks.
>>> --
>>>    [image: eXenSa]
>>>  *Guillaume PITEL, Président*
>>> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>>>  eXenSa S.A.S. <>
>>>  41, rue Périer - 92120 Montrouge - FRANCE
>>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

View raw message