Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com
 designates 209.85.220.170 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=NaoAk4a8f+XpcvpZaQJoK8hjsHhpdC0HBYMup/s9S70rXGXk2net8TVAzpWN8Whmua
         cw53CgOOvdrotj5U5KZQgFX69S/M5y/LOlcMTonymYRWmTMGqVbsTpKbHQwBTJOb5Hsv
         ZP6B2D9AympYldxazuxoNpKfCMCkKGdr4CJdI=
MIME-Version: 1.0
In-Reply-To: <BANLkTimwAoFD+A_t-fFd16DpHGejy1uxBg@mail.gmail.com>
References: <BANLkTi=zn3Y3CiZ7emhBC_6qy_hJ1JH2Kw@mail.gmail.com>
 <BANLkTimwAoFD+A_t-fFd16DpHGejy1uxBg@mail.gmail.com>
From: Ted Dunning <ted.dunning@gmail.com>
Date: Wed, 1 Jun 2011 07:33:32 -0700
Message-ID: <BANLkTinnVoQbUKos_8ZBBQfNydY_5dvwqA@mail.gmail.com>
Subject: Re: Measuring randomness
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=bcaec5016221d3b34604a4a768a9

--bcaec5016221d3b34604a4a768a9
Content-Type: text/plain; charset=UTF-8

On Wed, Jun 1, 2011 at 1:17 AM, Sean Owen <srowen@gmail.com> wrote:

> In both cases, every element is picked with probability N/1000. That is the
> purest sense in which these processes can be wrong or right, to me, and
> they
> are both exactly as good as the underlying pseudo-random number generator.
> The difference is not their quality, but the number of elements that are
> chosen.
>

And how that number is specified.  And whether order is preserved.  And
whether you get samples along the way so that you can overlap computation
with I/O.

I am not sure what the distribution the median of the N values should follow
> in theory. I doubt it's Gaussian.


It is asymptotically
normal<http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177728598>,
for pretty broad assumptions.  For normal underlying distribution, it
converges very quickly.  For a whacky underlying distribution like the
Cauchy, less quickly.

http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177728598


> But that would be your question then --
> how likely is it that the 20 observed values are generated by this
> distribution?
>

But this doesn't really answer an important question because the underlying
data was sampled from the same distribution and a variety of defective
samplers would give similar results.


> This test would not prove all aspects of the sampler work. For example, a
> sampler that never picked 0 or 999 would have the same result (well, if
> N>2)
> as this one, when clearly it has a problem.
>

And I think that this sort of thing is the key question.

Make sure that you use sorted data as one test input.  Do a full median of
the samples because OnlineSummarizer doesn't like ordered data.


> But I think this is probably a more complicated question than you need ask
> in practice: what is the phenomenon you are worried will happen or not
> happen here?
>

Since the samplers are equal in quality by design, the only problem I can
imagine is code error.

--bcaec5016221d3b34604a4a768a9--