On Wed, Jun 1, 2011 at 1:17 AM, Sean Owen wrote:
> In both cases, every element is picked with probability N/1000. That is the
> purest sense in which these processes can be wrong or right, to me, and
> they
> are both exactly as good as the underlying pseudo-random number generator.
> The difference is not their quality, but the number of elements that are
> chosen.
>
And how that number is specified. And whether order is preserved. And
whether you get samples along the way so that you can overlap computation
with I/O.
I am not sure what the distribution the median of the N values should follow
> in theory. I doubt it's Gaussian.
It is asymptotically
normal,
for pretty broad assumptions. For normal underlying distribution, it
converges very quickly. For a whacky underlying distribution like the
Cauchy, less quickly.
http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177728598
> But that would be your question then --
> how likely is it that the 20 observed values are generated by this
> distribution?
>
But this doesn't really answer an important question because the underlying
data was sampled from the same distribution and a variety of defective
samplers would give similar results.
> This test would not prove all aspects of the sampler work. For example, a
> sampler that never picked 0 or 999 would have the same result (well, if
> N>2)
> as this one, when clearly it has a problem.
>
And I think that this sort of thing is the key question.
Make sure that you use sorted data as one test input. Do a full median of
the samples because OnlineSummarizer doesn't like ordered data.
> But I think this is probably a more complicated question than you need ask
> in practice: what is the phenomenon you are worried will happen or not
> happen here?
>
Since the samplers are equal in quality by design, the only problem I can
imagine is code error.