Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CB951DE32 for ; Wed, 4 Jul 2012 07:42:45 +0000 (UTC) Received: (qmail 23510 invoked by uid 500); 4 Jul 2012 07:42:44 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 23455 invoked by uid 500); 4 Jul 2012 07:42:44 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 23432 invoked by uid 99); 4 Jul 2012 07:42:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jul 2012 07:42:43 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of srowen@gmail.com designates 209.85.213.170 as permitted sender) Received: from [209.85.213.170] (HELO mail-yx0-f170.google.com) (209.85.213.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Jul 2012 07:42:37 +0000 Received: by yenl12 with SMTP id l12so11429542yen.1 for ; Wed, 04 Jul 2012 00:42:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=BY4iGNZU+bNqNtIeVQb8/XgC8oII+3Wb77MgyW0BgsY=; b=fsv8mnNgqUFwioq7ISVC9Ic9xlP1uWHX9Fu/PHgqCPdiGSxz/ZCXDu5B9yZgKWNEDW K/PqX0KNZwMLWakgGeP6keGUSRH3hUD0NvgTyE/DsfzO5Jj+oCgD08Jw8rc0G2B7nq/R 5zPZWH0XZKc1Da+Y9rRWp9WpFGQrBXtDn3P8gCs5pIOn7e8PcrxcYXrHWgqKrJgVDG2A 8No5zpJN1hOWWBx1qj53k0HXpbABJttYdxzxpprYREN1d/jUpuuXkAHAAx2qhUvbvZmj 61kPCqpyIy7PNnsD6OjP9eK2OYJKkgrVNGP8U/F5OHCQAN06v4SuIECoYmN0AjDGvExp w/XQ== MIME-Version: 1.0 Received: by 10.50.237.34 with SMTP id uz2mr10780382igc.19.1341387736978; Wed, 04 Jul 2012 00:42:16 -0700 (PDT) Received: by 10.50.30.198 with HTTP; Wed, 4 Jul 2012 00:42:16 -0700 (PDT) In-Reply-To: References: Date: Wed, 4 Jul 2012 10:42:16 +0300 Message-ID: Subject: Re: Approaches for combining multiple types of item data for user-user similarity From: Sean Owen To: user@mahout.apache.org Content-Type: text/plain; charset=UTF-8 The best default answer is to put them all in one model. The math doesn't care what the things are. Unless you have a strong reason to weight one data set I wouldn't. If you do, then two models is best. It is hard to weight a subset of the data within most similarity functions. I don't think it would in Pearson for instance but could work in Tanimoto. On Wed, Jul 4, 2012 at 1:20 AM, Ken Krugler wrote: > Hi all, > > I'm curious what approaches are recommended for generating user-user similarity, when I've got two (or more) distinct types of item data, both of which are fairly large. > > E.g. let's say I had a set of users where I knew both (a) what books they had bought on Amazon, and (b) what YouTube videos they had watched. > > For each user, I want to find the 10 most similar other users. > > - I could create two separate models, find the nearest 30 users for each user, and combine (maybe with weighting) the results. > - I could toss all of the data into one model - and I could use a value of < 1.0 for whichever type of preference is less important. > > Any other suggestions? Input on the above two approaches? > > Thanks! > > -- Ken > > -------------------------- > Ken Krugler > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > > >