Hi Tom,
Thank you very much for your detailed explanation. I think it is very
helpful to me.
On Sat, Apr 12, 2014 at 1:06 PM, Tom V <minnesota.cs@gmail.com> wrote:
> The last writer is suggesting using the triangle inequality to cut down
> the search space. If c is the centroid of cluster C, then the closest any
> point in C is to x is xc  r(C), where r(C) is the (precomputed)
> radius of the clusterthe distance of the farthest point in C to c.
> Whether you want to use the bound for an exact algorithm or a search
> heuristic is up to you.
>
> Another approach can be very successful if the userattributes matrix is
> very sparse: The idea is to exploit the additive nature of common
> similarity measures. I have used this for cosine and Bayesian posterior.
> Here is the basic idea: A "User" index contains all the attributes for a
> certain user (sorted by strength), and an "Attribute" index contains all
> the users with a certain attribute. (Again, sorted by strength). To find
> the best k matches for user i, start with user i's strongest attribute and
> search the attribute index to find some users with high scores for that
> attribute. Move to user i's next best attribute. If you proceed with a
> "diagonal frontier" (as you fetch candidates bases on i's weaker
> attributes, also go back and fetch more weak candidates for i's strong
> attributes), you can get a candidate pool and a bound on the highest score
> of any user not in the candidate pool. You just expand the candidate pool
> until the bound drops below the scores of the k best candidates.
> Practically, you'd want to limit the candidate pool size, but it's almost
> always exact.
>
> For a million users, you should be able to distribute the things needed to
> make a recommendation (either the centroids or the attributes matrix), and
> just break up the work based on the users you want to generate
> recommendations for. I hope this helps.
>
> Tom
>
>
> On Sat, Apr 12, 2014 at 11:35 AM, Xiaoli Li <lixiaolimail1@gmail.com>wrote:
>
>> Hi Guillaume,
>>
>> This sounds a good idea to me. I am a newbie here. Could you further
>> explain how will you determine which clusters to keep? According to the
>> distance between each element with each cluster center?
>> Will you keep several clusters for each element for searching nearest
>> neighbours? Thanks.
>>
>> Xiaoli
>>
>>
>>
>>
>> On Sat, Apr 12, 2014 at 9:46 AM, Guillaume Pitel <
>> guillaume.pitel@exensa.com> wrote:
>>
>>> Hi,
>>>
>>> I'm doing this here for multiple tens of millions of elements (and the
>>> goal is to reach multiple billions), on a relatively small cluster (7 nodes
>>> 4 cores 32GB RAM). We use multiprobe KLSH. All you have to do is run a
>>> Kmeans on your data, then compute the distance between each element with
>>> each cluster center, keep a few clusters and only look into these clusters
>>> for nearest neighbours.
>>>
>>> This method is known to perform very well and vastly speedup your
>>> computation
>>>
>>> The hardest part is to decide how many clusters to compute, and how many
>>> to keep. As a rule of thumb, I generally want 30010000 elements per
>>> cluster, and use 520 clusters.
>>>
>>> Guillaume
>>>
>>>
>>> I am implementing an algorithm using Spark. I have one million users.
>>> I need to compute the similarity between each pair of users using some
>>> user's attributes. For each user, I need to get top k most similar users.
>>> What is the best way to implement this?
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>> 
>>> [image: eXenSa]
>>> *Guillaume PITEL, Président*
>>> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>>>
>>> eXenSa S.A.S. <http://www.exensa.com/>
>>> 41, rue Périer  92120 Montrouge  FRANCE
>>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>>>
>>
>>
>
