mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: why use the job 'itemIDIndex' to convert the itemid to index?
Date Tue, 20 Sep 2011 10:37:21 GMT
It is a problem -- but should be are. IDs are hashed to 31-bit
integers, so the probability of collision is small. However you don't
have to have too many items before it's probable that some two have
collided. (IIRC, that's about 2 ^ (31/2) ? )

In practice it doesn't hurt much. It just means that data from two
different items has been mixed up and treated as if it was all from
one item. That's not ideal, but has a tiny overall effect on
recommendations.

Another practical tip: if your item IDs all fit into an unsigned int
already, then the hash function won't mix them up at all as all of
them will hash to themselves.

2011/9/20 张玉东 <zhangyudong@vancl.cn>:
> I am trouble with this problem, if two itemids are mapped to the same index, then how
to compute the similarity between them?
>
>
>

Mime
View raw message