Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of srowen@gmail.com designates
 209.85.213.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <F3257A66-A941-4C55-AFF5-EC1CC31DFF31@transpac.com>
References: <F3257A66-A941-4C55-AFF5-EC1CC31DFF31@transpac.com>
Date: Wed, 4 Jul 2012 10:42:16 +0300
Message-ID: 
 <CAEccTyyx7s-NLgQcZoSmmhLfBdmA1B0cT24TXE3sA+kRe1ZbVA@mail.gmail.com>
Subject: Re: Approaches for combining multiple types of item data for
 user-user similarity
From: Sean Owen <srowen@gmail.com>
To: user@mahout.apache.org
Content-Type: text/plain; charset=UTF-8

The best default answer is to put them all in one model. The math
doesn't care what the things are. Unless you have a strong reason to
weight one data set I wouldn't. If you do, then two models is best. It
is hard to weight a subset of the data within most similarity
functions. I don't think it would in Pearson for instance but could
work in Tanimoto.

On Wed, Jul 4, 2012 at 1:20 AM, Ken Krugler <kkrugler_lists@transpac.com> wrote:
> Hi all,
>
> I'm curious what approaches are recommended for generating user-user similarity, when I've got two (or more) distinct types of item data, both of which are fairly large.
>
> E.g. let's say I had a set of users where I knew both (a) what books they had bought on Amazon, and (b) what YouTube videos they had watched.
>
> For each user, I want to find the 10 most similar other users.
>
>  - I could create two separate models, find the nearest 30 users for each user, and combine (maybe with weighting) the results.
>  - I could toss all of the data into one model - and I could use a value of < 1.0 for whichever type of preference is less important.
>
> Any other suggestions? Input on the above two approaches?
>
> Thanks!
>
> -- Ken
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>