Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E98A068BD for ; Wed, 1 Jun 2011 14:34:20 +0000 (UTC) Received: (qmail 94247 invoked by uid 500); 1 Jun 2011 14:34:19 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 94215 invoked by uid 500); 1 Jun 2011 14:34:19 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 94207 invoked by uid 99); 1 Jun 2011 14:34:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2011 14:34:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vx0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2011 14:34:12 +0000 Received: by vxb40 with SMTP id 40so9948882vxb.1 for ; Wed, 01 Jun 2011 07:33:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=1UxuRvq/D+bJ3NWnMIcGLxoMMAyXgyHP7U4EPLZLYgY=; b=FJ5H8j4eZagJEnijxPCv222prl0xeD6NmxAH41IB4PPqESuFd2OhuJaSjnl1ajqFWx sEKXPX6SIutRvEngP0FO53qzUwjaRbNNC94WKXlJjm+jkjq4/MWdfFDP3bYZGpzCJlvd f7VZhq3Y3Xq9BnWMXBZTdWHX/qJZhrImLNmeU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=NaoAk4a8f+XpcvpZaQJoK8hjsHhpdC0HBYMup/s9S70rXGXk2net8TVAzpWN8Whmua cw53CgOOvdrotj5U5KZQgFX69S/M5y/LOlcMTonymYRWmTMGqVbsTpKbHQwBTJOb5Hsv ZP6B2D9AympYldxazuxoNpKfCMCkKGdr4CJdI= Received: by 10.52.76.102 with SMTP id j6mr1294574vdw.44.1306938832040; Wed, 01 Jun 2011 07:33:52 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.110.101 with HTTP; Wed, 1 Jun 2011 07:33:32 -0700 (PDT) In-Reply-To: References: From: Ted Dunning Date: Wed, 1 Jun 2011 07:33:32 -0700 Message-ID: Subject: Re: Measuring randomness To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=bcaec5016221d3b34604a4a768a9 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec5016221d3b34604a4a768a9 Content-Type: text/plain; charset=UTF-8 On Wed, Jun 1, 2011 at 1:17 AM, Sean Owen wrote: > In both cases, every element is picked with probability N/1000. That is the > purest sense in which these processes can be wrong or right, to me, and > they > are both exactly as good as the underlying pseudo-random number generator. > The difference is not their quality, but the number of elements that are > chosen. > And how that number is specified. And whether order is preserved. And whether you get samples along the way so that you can overlap computation with I/O. I am not sure what the distribution the median of the N values should follow > in theory. I doubt it's Gaussian. It is asymptotically normal, for pretty broad assumptions. For normal underlying distribution, it converges very quickly. For a whacky underlying distribution like the Cauchy, less quickly. http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177728598 > But that would be your question then -- > how likely is it that the 20 observed values are generated by this > distribution? > But this doesn't really answer an important question because the underlying data was sampled from the same distribution and a variety of defective samplers would give similar results. > This test would not prove all aspects of the sampler work. For example, a > sampler that never picked 0 or 999 would have the same result (well, if > N>2) > as this one, when clearly it has a problem. > And I think that this sort of thing is the key question. Make sure that you use sorted data as one test input. Do a full median of the samples because OnlineSummarizer doesn't like ordered data. > But I think this is probably a more complicated question than you need ask > in practice: what is the phenomenon you are worried will happen or not > happen here? > Since the samplers are equal in quality by design, the only problem I can imagine is code error. --bcaec5016221d3b34604a4a768a9--