Return-Path: Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: (qmail 57750 invoked from network); 30 Mar 2011 14:55:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Mar 2011 14:55:58 -0000 Received: (qmail 44126 invoked by uid 500); 30 Mar 2011 14:55:58 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 43968 invoked by uid 500); 30 Mar 2011 14:55:58 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 43960 invoked by uid 99); 30 Mar 2011 14:55:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Mar 2011 14:55:58 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dmcennis@gmail.com designates 209.85.216.42 as permitted sender) Received: from [209.85.216.42] (HELO mail-qw0-f42.google.com) (209.85.216.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Mar 2011 14:55:50 +0000 Received: by qwi4 with SMTP id 4so1840362qwi.1 for ; Wed, 30 Mar 2011 07:55:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=y+8/J2mTkT7RwU9fbotlWA2MePse/VXz3gvm9F751EU=; b=qXTKIKthfHKEtaVnVTecWBoFF5Tu+FNlEfVXbRCQR76kLo5qhwoEOE/ldu7/vWVEdu sEcKNfZUlSSaStikByAJ98fI8hn9sdpkv3L1fTWgjRnv+s5pYpAC63ygJ57hIOenciQ1 ESu2noC910vWd+DMkXWFo6gtRaZCwIMPQSKEw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=FgS2LEhp3WslCu4R8iLQ2FvJna1GnEcsey0AN24jUzlaUElPnn4gNFUCwDLL/nhKUU DXKMwVKRRHF0TmBhLX8luUJOLsl6soogVFe8pXwOXG+NoTf85wFBAyXdjqYIia3fbiMS aexPx3+Vafb0iqihuLNoj7bcpIa4VILzAhK4Y= MIME-Version: 1.0 Received: by 10.224.127.205 with SMTP id h13mr1190738qas.2.1301496929505; Wed, 30 Mar 2011 07:55:29 -0700 (PDT) Received: by 10.229.111.213 with HTTP; Wed, 30 Mar 2011 07:55:29 -0700 (PDT) In-Reply-To: <4D92C13F.8030101@apache.org> References: <4D92C13F.8030101@apache.org> Date: Wed, 30 Mar 2011 10:55:29 -0400 Message-ID: Subject: Re: new distance metric From: Daniel McEnnis To: dev@mahout.apache.org, ssc@apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Sebastion, It will be in the next patch. Thanks for the heads up. Daniel. On Wed, Mar 30, 2011 at 1:35 AM, Sebastian Schelter wrote: > Hi Daniel, > > We would also need a "distributed" implementation of this new metric. Cou= ld > you do that too? > > Shouldn't be too hard, just have a look at the other implementations in > org.apache.mahout.math.hadoop.similarity.vector. > > --sebastian > > > On 30.03.2011 00:40, Sean Owen wrote: >> >> Great, the best place for this would be a JIRA issue: >> https://issues.apache.org/jira/browse/MAHOUT >> I think it needs a bit of style work. For example, it ought to be very >> much like TanimotoCoefficientSimilarity. If you copied that and edited >> a few key methods, you'd be a lot closer I think. >> I guess I find the core computation a little quirky: >> >> =A0 =A0 =A0 =A0 =A0 =A0 double distance =3D preferring1+preferring2 - 2*= intersection; >> =A0 =A0 =A0 =A0 =A0 =A0if(distance< =A01.0){ >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0distance=3D1.0-distance; >> =A0 =A0 =A0 =A0 =A0 =A0}else{ >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0distance =3D -1.0 + 1.0 / distance; >> =A0 =A0 =A0 =A0 =A0 =A0} >> >> distance is an int, so I think it's >> >> =A0 =A0 =A0 =A0 =A0 =A0 int distance =3D preferring1+preferring2 - 2*int= ersection; >> =A0 =A0 =A0 =A0 =A0 =A0if(distance =3D=3D 0){ >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0distance=3D1; >> =A0 =A0 =A0 =A0 =A0 =A0}else{ >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0distance =3D -1.0 + 1.0 / distance; >> =A0 =A0 =A0 =A0 =A0 =A0} >> >> The resulting values are a little odd then -- it can return values in >> [-1,0], or 1. >> >> By default I'd go with something more like "1.0 / (1.0 + distance)" I >> suppose, though that's not somehow the one right way to map a distance >> to a similarity -- though it would be consistent with >> EuclideanDistanceSimilarity. >> >> >> I'd actually welcome you to expand this idea and not just make a >> "boolean pref" version of this but one that computes an actual >> city-block distance for prefs with ratings too, for completeness. >> >> >> I know this as "Manhattan distance". Is that an Americanism or is that >> actually the more common name to anyone? >> >> >> >> On Tue, Mar 29, 2011 at 10:16 PM, Daniel McEnnis >> =A0wrote: >>> >>> Dear, >>> >>> Here is a patch of a new distance metric for the collaborative >>> filtering modules - CityBlockDistance. =A0With the 0 - 1 binary split o= n >>> preference. KLDistance, AHDistance, and Symmetric KLDistance don't >>> make sense. >>> >>> Daniel McEnnis. > >