Sebastian: I had a look at the distributed Euclidean similarity and it
computes similarity as ...
1  1 / (1+d). This is the wrong way around right? Higher distance
moves the value to 1.
For consistency, I'm looking to stick with a 1/(1+d) expression for
now (unless someone tells me that's just theoretically inferior for
sure).
I'm thinking of 1 / (1 + d/sqrt(n)) as a better attempt at normalizing
away the effect of more dimensions.
How's that sound, and shall I make the distributed version behave similarly?
On Wed, Oct 19, 2011 at 3:51 PM, Sean Owen <srowen@gmail.com> wrote:
> Interesting question came up recently about using the Euclidean
> distance d between two vectors as a notion of their similarity.
>
> You can use 1 / (1 + d), which mostly works, except that it
> 'penalizes' larger vectors, who have more dimensions along which to
> differ. This is bad when those vectors are the subsets of user pref
> data in which two users overlap: more overlap ought to mean higher
> similarity.
>
> I have an ancient, bad kludge in there that uses n / (1 + d), where n
> is the size of the two vectors. It's trying to normalize away the
> average distance between randomlychosen vectors in the space
> (remember that each dimension is bounded, between min and max rating).
> But that's not n.
>
> Is there a good formula or way of thinking about what that number
> should be? I can't find it on the internet.
>
