From user-return-8251-apmail-mahout-user-archive=mahout.apache.org@mahout.apache.org Wed Jun 1 15:59:33 2011 Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED72066BB for ; Wed, 1 Jun 2011 15:59:33 +0000 (UTC) Received: (qmail 25233 invoked by uid 500); 1 Jun 2011 15:59:32 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 25204 invoked by uid 500); 1 Jun 2011 15:59:32 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 25196 invoked by uid 99); 1 Jun 2011 15:59:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2011 15:59:32 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of hector.yee@gmail.com designates 209.85.212.171 as permitted sender) Received: from [209.85.212.171] (HELO mail-px0-f171.google.com) (209.85.212.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2011 15:59:25 +0000 Received: by pxi7 with SMTP id 7so3199230pxi.30 for ; Wed, 01 Jun 2011 08:59:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:references:in-reply-to:mime-version :content-transfer-encoding:content-type:message-id:cc:x-mailer:from :subject:date:to; bh=EsJyY/7Xk3qco5rbwVNou6CLUbXy5H7rLsRP0v1HDoM=; b=rJQQvquIKaX0t4cBwkiv6oNGsHVZsjVmdC4YpuRjCMlncqOccwySP7MIbMhWXyiyfA 6Y9PdlAO3ihCWS4rSMf7HdNW/pkyz2Sd9OqEmC2idawyy3y+i/UDOA/KM+koYl11Jn3W S/1L9vnmJB48NaD0pIYSdEo8eGODCW8LOutlw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=references:in-reply-to:mime-version:content-transfer-encoding :content-type:message-id:cc:x-mailer:from:subject:date:to; b=Avnm45mVRJiJI2cn2frohtPZ8BQBGOBRL8VLrrPNZi8INFZ5i9Vri5wXPvr1dKt5Aw 3ON+MITeksQlxciraiSS8FuZk2gsaWWNu1U0fLTJ1pwRZNMsS83QAfz5ONdEU4eqnNYG 08hgn+sZsz7O0UXdrRhPryf2plsCUKDn3gvBY= Received: by 10.142.250.32 with SMTP id x32mr1748808wfh.57.1306943944676; Wed, 01 Jun 2011 08:59:04 -0700 (PDT) Received: from [192.168.0.197] (c-69-181-197-78.hsd1.ca.comcast.net [69.181.197.78]) by mx.google.com with ESMTPS id z7sm723650wff.17.2011.06.01.08.59.02 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 01 Jun 2011 08:59:03 -0700 (PDT) References: In-Reply-To: Mime-Version: 1.0 (iPad Mail 8C148) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Message-Id: <2A3A1B09-75A4-4193-8731-C1600E0E34F8@gmail.com> Cc: user X-Mailer: iPad Mail (8C148) From: Hector Yee Subject: Re: Measuring randomness Date: Wed, 1 Jun 2011 09:10:00 -0700 To: "user@mahout.apache.org" You'll probably see more difference if the data was ordered no? For the reas= on you said below about R sampling the entire set and BR sampling the first X= Sent from my iPad On Jun 1, 2011, at 12:31 AM, Lance Norskog wrote: > I'm trying to do a bake-off between Bernoulli (B) sampling (drop if > random > percentage) v.s. Reservoir (R) sampling (maintain a box of > randomly chosen samples). >=20 > Here is a test (simplified for explanatory purposes): > * Create a list of 1000 numbers, 0-999. Permute this list. > * Subsample N values > * Add them and take the median > * Do this 20 times and record the medians > * Calculate the standard deviation of the 20 median values > This last is my score for 'how good is the randomness of this sampler'. >=20 > Does this make sense? In this measurement is small or large deviation > better? What is another way to measure it? >=20 > Notes: Bernoulli pulls X percent of the samples and ignores the rest. > Reservoir pulls all of the samples and saves X of them. However, it > saves the first N samples and slowly replaces them. This suppresses > the deviation for small samples. This realization came just now; I'll > cut that phase. > Really I used the OnlineSummarizer and did deviations of > mean/median/25 percentile/75 percentile. >=20 > I had a more detailed report with numbers, but just realized that > given the above I have to start over. >=20 > Barbie says: "designing experiments is hard!" >=20 >=20 > --=20 > Lance Norskog > goksron@gmail.com