mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Average distance between two points in unit hypercube?
Date Wed, 19 Oct 2011 19:28:27 GMT
Sebastian: I had a look at the distributed Euclidean similarity and it
computes similarity as ...

1 - 1 / (1+d). This is the wrong way around right? Higher distance
moves the value to 1.

For consistency, I'm looking to stick with a 1/(1+d) expression for
now (unless someone tells me that's just theoretically inferior for
sure).

I'm thinking of 1 / (1 + d/sqrt(n)) as a better attempt at normalizing
away the effect of more dimensions.

How's that sound, and shall I make the distributed version behave similarly?

On Wed, Oct 19, 2011 at 3:51 PM, Sean Owen <srowen@gmail.com> wrote:
> Interesting question came up recently about using the Euclidean
> distance d between two vectors as a notion of their similarity.
>
> You can use 1 / (1 + d), which mostly works, except that it
> 'penalizes' larger vectors, who have more dimensions along which to
> differ. This is bad when those vectors are the subsets of user pref
> data in which two users overlap: more overlap ought to mean higher
> similarity.
>
> I have an ancient, bad kludge in there that uses n / (1 + d), where n
> is the size of the two vectors. It's trying to normalize away the
> average distance between randomly-chosen vectors in the space
> (remember that each dimension is bounded, between min and max rating).
> But that's not n.
>
> Is there a good formula or way of thinking about what that number
> should be? I can't find it on the internet.
>

Mime
View raw message