In both cases, every element is picked with probability N/1000. That is the
purest sense in which these processes can be wrong or right, to me, and they
are both exactly as good as the underlying pseudorandom number generator.
The difference is not their quality, but the number of elements that are
chosen.
I am not sure what the distribution the median of the N values should follow
in theory. I doubt it's Gaussian. But that would be your question then 
how likely is it that the 20 observed values are generated by this
distribution?
This test would not prove all aspects of the sampler work. For example, a
sampler that never picked 0 or 999 would have the same result (well, if N>2)
as this one, when clearly it has a problem.
But I think this is probably a more complicated question than you need ask
in practice: what is the phenomenon you are worried will happen or not
happen here?
On Wed, Jun 1, 2011 at 8:31 AM, Lance Norskog <goksron@gmail.com> wrote:
> I'm trying to do a bakeoff between Bernoulli (B) sampling (drop if
> random > percentage) v.s. Reservoir (R) sampling (maintain a box of
> randomly chosen samples).
>
> Here is a test (simplified for explanatory purposes):
> * Create a list of 1000 numbers, 0999. Permute this list.
> * Subsample N values
> * Add them and take the median
> * Do this 20 times and record the medians
> * Calculate the standard deviation of the 20 median values
> This last is my score for 'how good is the randomness of this sampler'.
>
> Does this make sense? In this measurement is small or large deviation
> better? What is another way to measure it?
>
> Notes: Bernoulli pulls X percent of the samples and ignores the rest.
> Reservoir pulls all of the samples and saves X of them. However, it
> saves the first N samples and slowly replaces them. This suppresses
> the deviation for small samples. This realization came just now; I'll
> cut that phase.
> Really I used the OnlineSummarizer and did deviations of
> mean/median/25 percentile/75 percentile.
>
> I had a more detailed report with numbers, but just realized that
> given the above I have to start over.
>
> Barbie says: "designing experiments is hard!"
>
>
> 
> Lance Norskog
> goksron@gmail.com
>
