Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 753636AD0 for ; Sat, 2 Jul 2011 10:48:15 +0000 (UTC) Received: (qmail 66805 invoked by uid 500); 2 Jul 2011 10:48:15 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 65855 invoked by uid 500); 2 Jul 2011 10:48:06 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 65840 invoked by uid 99); 2 Jul 2011 10:48:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jul 2011 10:48:03 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of radek.maciaszek@gmail.com designates 209.85.213.170 as permitted sender) Received: from [209.85.213.170] (HELO mail-yx0-f170.google.com) (209.85.213.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jul 2011 10:47:55 +0000 Received: by yxk8 with SMTP id 8so2219475yxk.1 for ; Sat, 02 Jul 2011 03:47:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; bh=I/kXZ4GZMudnkHkvvA4TlHnZToTWSuUcimnL79K2UnU=; b=rx2wnRxi5w5fnbXuBE1v/cfc71whj/qYGg5FsXh6h2kpobqEa7DRpb1S5EPm8CgJKt Tfi8XL4Jth3hR1Ag0+io89l7go5WYIdjXMnatGP8cQbyfZA67hPM2VaUItw4CW6XSiNY udxpIblJfLivCkV25kpZ/TEdsy+FhcR3OvRps= Received: by 10.90.181.13 with SMTP id d13mr3802279agf.151.1309603654110; Sat, 02 Jul 2011 03:47:34 -0700 (PDT) MIME-Version: 1.0 Sender: radek.maciaszek@gmail.com Received: by 10.90.233.9 with HTTP; Sat, 2 Jul 2011 03:47:14 -0700 (PDT) In-Reply-To: <4D5EB49A.4060702@apache.org> References: <4D5EB49A.4060702@apache.org> From: Radek Maciaszek Date: Sat, 2 Jul 2011 11:47:14 +0100 X-Google-Sender-Auth: hjdZYyXpJitM90gqjnysBSuxxTI Message-ID: Subject: Re: Similarity between users' groups To: user@mahout.apache.org, ssc@apache.org Content-Type: multipart/alternative; boundary=00163630fee3998c7a04a713dc94 X-Virus-Checked: Checked by ClamAV on apache.org --00163630fee3998c7a04a713dc94 Content-Type: text/plain; charset=ISO-8859-1 Hello, This project was put on hold for a while so I only had a time to look into it recently. I was thinking about the idea of down-sampling and different sampling strategies. What would be the minimum rate of sampling the users? Right now I sample 1 in 256 users. But if there will be only 400 users in a group I will not get as good estimate as if there would have 10k users. I am trying to find out here the strategy for downsampling. I was hoping there should be some statistical way of estimating sampling ratio? Cheers, Radek On 18 February 2011 18:04, Sebastian Schelter wrote: > This shouldn't be too difficult and would maybe make a good newcomer or > student project. > > --sebastian > > Am 18.02.2011 18:19, schrieb Ted Dunning: > > A better way to sample is to find groups with a very large number of > users > > and downsample the number of users to a maximum of about 1000 (or even > 200 > > if you want to be more aggressive). Do the same with users. > > > > That won't delete a whole lot data volume, but it will make most > > recommendation algorithms go much faster. The idea is that after you > have > > 200 or more users in a group, you aren't learning anything new anyway. > > > > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek > > wrote: > > > >> Each user can belong to > >> many groups so the number of combinations is rather big. In fact this > >> number > >> of combinations is so large I am considering to sample the users and > only > >> analyse 1 in about 256 users. So essentially I would have about 1000+ > >> groups > >> and about 150k users. Since one user can potentially belong to many > dozens > >> of groups this will easily go into millions of records anyway but > perhaps > >> will be lower than 100M margin you mentioned. > >> > > > > --00163630fee3998c7a04a713dc94--