Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F74E7398 for ; Tue, 20 Sep 2011 10:37:49 +0000 (UTC) Received: (qmail 36656 invoked by uid 500); 20 Sep 2011 10:37:48 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 36625 invoked by uid 500); 20 Sep 2011 10:37:47 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 36614 invoked by uid 99); 20 Sep 2011 10:37:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Sep 2011 10:37:47 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of srowen@gmail.com designates 209.85.214.42 as permitted sender) Received: from [209.85.214.42] (HELO mail-bw0-f42.google.com) (209.85.214.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Sep 2011 10:37:41 +0000 Received: by bkar4 with SMTP id r4so821219bka.1 for ; Tue, 20 Sep 2011 03:37:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=iwXDDPK0kwpeVsgjH31cuclxwFLEo/S0S82xZwGZHjQ=; b=ugBKbYnR2Dk1Nr+1tojYmwmyoTHyoqEbhR9nJkTO5n8eh/zA6EgEHoK71v6MnI5l/+ omJRSNVf4/RcxurdJwEOvSs7T1TjBeo10yGzQECEAI3K6YmdRBMtoWimjAwmydfS/R7a 59F/uLS1AC3wULFRqZpfljmpmSUougkrRgyrc= MIME-Version: 1.0 Received: by 10.204.157.12 with SMTP id z12mr445886bkw.289.1316515041500; Tue, 20 Sep 2011 03:37:21 -0700 (PDT) Received: by 10.204.101.129 with HTTP; Tue, 20 Sep 2011 03:37:21 -0700 (PDT) In-Reply-To: <4D42776F8E08814BB499E724E294A4950628FFF4@EX-MB-02.vancloa.cn> References: <4D42776F8E08814BB499E724E294A4950628FFF4@EX-MB-02.vancloa.cn> Date: Tue, 20 Sep 2011 11:37:21 +0100 Message-ID: Subject: Re: why use the job 'itemIDIndex' to convert the itemid to index? From: Sean Owen To: user@mahout.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org It is a problem -- but should be are. IDs are hashed to 31-bit integers, so the probability of collision is small. However you don't have to have too many items before it's probable that some two have collided. (IIRC, that's about 2 ^ (31/2) ? ) In practice it doesn't hurt much. It just means that data from two different items has been mixed up and treated as if it was all from one item. That's not ideal, but has a tiny overall effect on recommendations. Another practical tip: if your item IDs all fit into an unsigned int already, then the hash function won't mix them up at all as all of them will hash to themselves. 2011/9/20 =E5=BC=A0=E7=8E=89=E4=B8=9C : > I am trouble with this problem, if two itemids are mapped to the same ind= ex, then how to compute the similarity between them? > > >