mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Similarity between users' groups
Date Sat, 02 Jul 2011 21:22:47 GMT
"reservoir sampling" lets you make good per-user sample sets. This has
code demonstrating the approach.

https://issues.apache.org/jira/browse/MAHOUT-676

How to do this in an efficient way? No idea.

On Sat, Jul 2, 2011 at 9:18 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Don't sample at a constant rate.
>
> Either downsample user ratings so that no user has more than a reasonable
> number of ratings or downsample users so that no thing has more than a
> reasonable number of users rating it.
>
> I generally prefer the former, but either should be fine.
>
> On Sat, Jul 2, 2011 at 3:47 AM, Radek Maciaszek <radek@maciaszek.co.uk>wrote:
>
>> Hello,
>>
>> This project was put on hold for a while so I only had a time to look into
>> it recently. I was thinking about the idea of down-sampling and different
>> sampling strategies.
>>
>> What would be the minimum rate of sampling the users? Right now I sample 1
>> in 256 users. But if there will be only 400 users in a group I will not get
>> as good estimate as if there would have 10k users. I am trying to find out
>> here the strategy for downsampling.
>>
>> I was hoping there should be some statistical way of estimating sampling
>> ratio?
>>
>> Cheers,
>> Radek
>>
>> On 18 February 2011 18:04, Sebastian Schelter <ssc@apache.org> wrote:
>>
>> > This shouldn't be too difficult and would maybe make a good newcomer or
>> > student project.
>> >
>> > --sebastian
>> >
>> > Am 18.02.2011 18:19, schrieb Ted Dunning:
>> > > A better way to sample is to find groups with a very large number of
>> > users
>> > > and downsample the number of users to a maximum of about 1000 (or even
>> > 200
>> > > if you want to be more aggressive).  Do the same with users.
>> > >
>> > > That won't delete a whole lot data volume, but it will make most
>> > > recommendation algorithms go much faster.  The idea is that after you
>> > have
>> > > 200 or more users in a group, you aren't learning anything new anyway.
>> > >
>> > > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
>> > > <radek.maciaszek@gmail.com>wrote:
>> > >
>> > >>  Each user can belong to
>> > >> many groups so the number of combinations is rather big. In fact this
>> > >> number
>> > >> of combinations is so large I am considering to sample the users and
>> > only
>> > >> analyse 1 in about 256 users. So essentially I would have about 1000+
>> > >> groups
>> > >> and about 150k users. Since one user can potentially belong to many
>> > dozens
>> > >> of groups this will easily go into millions of records anyway but
>> > perhaps
>> > >> will be lower than 100M margin you mentioned.
>> > >>
>> > >
>> >
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message