You'll probably see more difference if the data was ordered no? For the reason you said below about R sampling the entire set and BR sampling the first X Sent from my iPad On Jun 1, 2011, at 12:31 AM, Lance Norskog wrote: > I'm trying to do a bake-off between Bernoulli (B) sampling (drop if > random > percentage) v.s. Reservoir (R) sampling (maintain a box of > randomly chosen samples). > > Here is a test (simplified for explanatory purposes): > * Create a list of 1000 numbers, 0-999. Permute this list. > * Subsample N values > * Add them and take the median > * Do this 20 times and record the medians > * Calculate the standard deviation of the 20 median values > This last is my score for 'how good is the randomness of this sampler'. > > Does this make sense? In this measurement is small or large deviation > better? What is another way to measure it? > > Notes: Bernoulli pulls X percent of the samples and ignores the rest. > Reservoir pulls all of the samples and saves X of them. However, it > saves the first N samples and slowly replaces them. This suppresses > the deviation for small samples. This realization came just now; I'll > cut that phase. > Really I used the OnlineSummarizer and did deviations of > mean/median/25 percentile/75 percentile. > > I had a more detailed report with numbers, but just realized that > given the above I have to start over. > > Barbie says: "designing experiments is hard!" > > > -- > Lance Norskog > goksron@gmail.com