mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 张玉东 <zhangyud...@vancl.cn>
Subject Re: why use the job 'itemIDIndex' to convert the itemid to index?
Date Tue, 20 Sep 2011 10:44:47 GMT
Yes, the probability of collision is quite small. But I mean it is not necessary to do this
step, I can not find any help of it to the following computations.

-----邮件原件-----
发件人: Sean Owen [mailto:srowen@gmail.com] 
发送时间: 2011年9月20日 18:37
收件人: user@mahout.apache.org
主题: Re: why use the job 'itemIDIndex' to convert the itemid to index?

It is a problem -- but should be are. IDs are hashed to 31-bit
integers, so the probability of collision is small. However you don't
have to have too many items before it's probable that some two have
collided. (IIRC, that's about 2 ^ (31/2) ? )

In practice it doesn't hurt much. It just means that data from two
different items has been mixed up and treated as if it was all from
one item. That's not ideal, but has a tiny overall effect on
recommendations.

Another practical tip: if your item IDs all fit into an unsigned int
already, then the hash function won't mix them up at all as all of
them will hash to themselves.

2011/9/20 张玉东 <zhangyudong@vancl.cn>:
> I am trouble with this problem, if two itemids are mapped to the same index, then how
to compute the similarity between them?
>
>
>
Mime
View raw message