On Wed, Jun 1, 2011 at 1:17 AM, Sean Owen wrote: > In both cases, every element is picked with probability N/1000. That is the > purest sense in which these processes can be wrong or right, to me, and > they > are both exactly as good as the underlying pseudo-random number generator. > The difference is not their quality, but the number of elements that are > chosen. > And how that number is specified. And whether order is preserved. And whether you get samples along the way so that you can overlap computation with I/O. I am not sure what the distribution the median of the N values should follow > in theory. I doubt it's Gaussian. It is asymptotically normal, for pretty broad assumptions. For normal underlying distribution, it converges very quickly. For a whacky underlying distribution like the Cauchy, less quickly. http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177728598 > But that would be your question then -- > how likely is it that the 20 observed values are generated by this > distribution? > But this doesn't really answer an important question because the underlying data was sampled from the same distribution and a variety of defective samplers would give similar results. > This test would not prove all aspects of the sampler work. For example, a > sampler that never picked 0 or 999 would have the same result (well, if > N>2) > as this one, when clearly it has a problem. > And I think that this sort of thing is the key question. Make sure that you use sorted data as one test input. Do a full median of the samples because OnlineSummarizer doesn't like ordered data. > But I think this is probably a more complicated question than you need ask > in practice: what is the phenomenon you are worried will happen or not > happen here? > Since the samplers are equal in quality by design, the only problem I can imagine is code error.