Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 18101943D for ; Fri, 13 Jan 2012 16:58:33 +0000 (UTC) Received: (qmail 91988 invoked by uid 500); 13 Jan 2012 16:58:31 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 91920 invoked by uid 500); 13 Jan 2012 16:58:30 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 91911 invoked by uid 99); 13 Jan 2012 16:58:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Jan 2012 16:58:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of raviv@gigya-inc.com designates 209.85.214.42 as permitted sender) Received: from [209.85.214.42] (HELO mail-bk0-f42.google.com) (209.85.214.42) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Jan 2012 16:58:26 +0000 Received: by bkcje16 with SMTP id je16so5269443bkc.1 for ; Fri, 13 Jan 2012 08:58:04 -0800 (PST) Received: by 10.204.136.195 with SMTP id s3mr806476bkt.51.1326473883502; Fri, 13 Jan 2012 08:58:03 -0800 (PST) MIME-Version: 1.0 Received: by 10.204.77.141 with HTTP; Fri, 13 Jan 2012 08:57:22 -0800 (PST) In-Reply-To: <4F105F80.9090500@windwardsolutions.com> References: <1326397095608-3654678.post@n3.nabble.com> <1326400837739-3654848.post@n3.nabble.com> <1326445732621-3656144.post@n3.nabble.com> <4F105F80.9090500@windwardsolutions.com> From: Raviv Pavel Date: Fri, 13 Jan 2012 18:57:22 +0200 Message-ID: Subject: Re: Clustering user profiles To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0015175cd2a8a148f504b66bc443 --0015175cd2a8a148f504b66bc443 Content-Type: text/plain; charset=ISO-8859-1 True. That's why I think need a different distance measure for each attribute of the user. The distance between coordinates on earth is different from distance between ages which in turn is different from the distance between two sets of values I think the only solution would be do develop a custom distance measure that's aware of the "meaning" of each dimension(s) and return the distance accordingly. Unless there is a way to vectorize user profiles in such a way that will allow me to use one of the built in distance measures. * * *--*Raviv On Fri, Jan 13, 2012 at 6:44 PM, Jeff Eastman wrote: > Just remember that Longitude is a spherical coordinate and +179 is closer > to -179 than their numeric difference. Latitude is spherical too but +89 is > indeed quite far from -89. > > > > On 1/13/12 4:36 AM, StreetCat wrote: > >> The raw data had location expressed as strings such as "Paris, France" and >> I translated them into coordinates, so measuring the distance between two >> users' location would be trivial. >> >> >> On Fri, Jan 13, 2012 at 1:19 PM, Dan Brickley wrote: >> >> On 13 January 2012 12:02, Robert Stewart wrote: >>> >>>> Rather than using Gender as a single dimension, why not make Male and >>>> >>> Female as separate dimensions, with values 0 or 1 if True or False? >>> >>> d[1] = 15.5 (latitude) >>>>>> d[2] = 50.5 (longitude) >>>>>> >>>>> Raw lat/long can be rather cryptic. The Geonames folk have Web >>> services (and/or downloadable data) that maps these to more socially >>> relevant entities. >>> >>> See http://www.geonames.org/**export/web-services.html#**findNearby >>> e.g. >>> http://api.geonames.org/**extendedFindNearby?lat=47.3&** >>> lng=9&username=demo >>> >>> There's also a lat/long to Wikipedia entry service, see >>> >>> http://www.geonames.org/**export/wikipedia-webservice.** >>> html#findNearbyWikipedia >>> ...which will get you entities know to DBpedia, Freebase etc., >>> allowing more national or regional features to be folded in if needed. >>> >>> Why have the machine learning layers re-learn stuff that can just be >>> looked up in a free encyclopaedia? Better to enrich than >>> rediscover...? >>> >>> cheers, >>> >>> Dan >>> >>> > --0015175cd2a8a148f504b66bc443--