I'm trying to do a bakeoff between Bernoulli (B) sampling (drop if
random > percentage) v.s. Reservoir (R) sampling (maintain a box of
randomly chosen samples).
Here is a test (simplified for explanatory purposes):
* Create a list of 1000 numbers, 0999. Permute this list.
* Subsample N values
* Add them and take the median
* Do this 20 times and record the medians
* Calculate the standard deviation of the 20 median values
This last is my score for 'how good is the randomness of this sampler'.
Does this make sense? In this measurement is small or large deviation
better? What is another way to measure it?
Notes: Bernoulli pulls X percent of the samples and ignores the rest.
Reservoir pulls all of the samples and saves X of them. However, it
saves the first N samples and slowly replaces them. This suppresses
the deviation for small samples. This realization came just now; I'll
cut that phase.
Really I used the OnlineSummarizer and did deviations of
mean/median/25 percentile/75 percentile.
I had a more detailed report with numbers, but just realized that
given the above I have to start over.
Barbie says: "designing experiments is hard!"

Lance Norskog
goksron@gmail.com
