mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radek Maciaszek <ra...@maciaszek.co.uk>
Subject Re: Similarity between users' groups
Date Sat, 02 Jul 2011 10:47:14 GMT
Hello,

This project was put on hold for a while so I only had a time to look into
it recently. I was thinking about the idea of down-sampling and different
sampling strategies.

What would be the minimum rate of sampling the users? Right now I sample 1
in 256 users. But if there will be only 400 users in a group I will not get
as good estimate as if there would have 10k users. I am trying to find out
here the strategy for downsampling.

I was hoping there should be some statistical way of estimating sampling
ratio?

Cheers,
Radek

On 18 February 2011 18:04, Sebastian Schelter <ssc@apache.org> wrote:

> This shouldn't be too difficult and would maybe make a good newcomer or
> student project.
>
> --sebastian
>
> Am 18.02.2011 18:19, schrieb Ted Dunning:
> > A better way to sample is to find groups with a very large number of
> users
> > and downsample the number of users to a maximum of about 1000 (or even
> 200
> > if you want to be more aggressive).  Do the same with users.
> >
> > That won't delete a whole lot data volume, but it will make most
> > recommendation algorithms go much faster.  The idea is that after you
> have
> > 200 or more users in a group, you aren't learning anything new anyway.
> >
> > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
> > <radek.maciaszek@gmail.com>wrote:
> >
> >>  Each user can belong to
> >> many groups so the number of combinations is rather big. In fact this
> >> number
> >> of combinations is so large I am considering to sample the users and
> only
> >> analyse 1 in about 256 users. So essentially I would have about 1000+
> >> groups
> >> and about 150k users. Since one user can potentially belong to many
> dozens
> >> of groups this will easily go into millions of records anyway but
> perhaps
> >> will be lower than 100M margin you mentioned.
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message