mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Similarity between users' groups
Date Sat, 02 Jul 2011 16:18:30 GMT
Don't sample at a constant rate.

Either downsample user ratings so that no user has more than a reasonable
number of ratings or downsample users so that no thing has more than a
reasonable number of users rating it.

I generally prefer the former, but either should be fine.

On Sat, Jul 2, 2011 at 3:47 AM, Radek Maciaszek <radek@maciaszek.co.uk>wrote:

> Hello,
>
> This project was put on hold for a while so I only had a time to look into
> it recently. I was thinking about the idea of down-sampling and different
> sampling strategies.
>
> What would be the minimum rate of sampling the users? Right now I sample 1
> in 256 users. But if there will be only 400 users in a group I will not get
> as good estimate as if there would have 10k users. I am trying to find out
> here the strategy for downsampling.
>
> I was hoping there should be some statistical way of estimating sampling
> ratio?
>
> Cheers,
> Radek
>
> On 18 February 2011 18:04, Sebastian Schelter <ssc@apache.org> wrote:
>
> > This shouldn't be too difficult and would maybe make a good newcomer or
> > student project.
> >
> > --sebastian
> >
> > Am 18.02.2011 18:19, schrieb Ted Dunning:
> > > A better way to sample is to find groups with a very large number of
> > users
> > > and downsample the number of users to a maximum of about 1000 (or even
> > 200
> > > if you want to be more aggressive).  Do the same with users.
> > >
> > > That won't delete a whole lot data volume, but it will make most
> > > recommendation algorithms go much faster.  The idea is that after you
> > have
> > > 200 or more users in a group, you aren't learning anything new anyway.
> > >
> > > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
> > > <radek.maciaszek@gmail.com>wrote:
> > >
> > >>  Each user can belong to
> > >> many groups so the number of combinations is rather big. In fact this
> > >> number
> > >> of combinations is so large I am considering to sample the users and
> > only
> > >> analyse 1 in about 256 users. So essentially I would have about 1000+
> > >> groups
> > >> and about 150k users. Since one user can potentially belong to many
> > dozens
> > >> of groups this will easily go into millions of records anyway but
> > perhaps
> > >> will be lower than 100M margin you mentioned.
> > >>
> > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message