mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hector Yee <>
Subject Re: Measuring randomness
Date Wed, 01 Jun 2011 16:10:00 GMT
You'll probably see more difference if the data was ordered no? For the reason you said below
about R sampling the entire set and BR sampling the first X

Sent from my iPad

On Jun 1, 2011, at 12:31 AM, Lance Norskog <> wrote:

> I'm trying to do a bake-off between Bernoulli (B) sampling (drop if
> random > percentage) v.s. Reservoir (R) sampling (maintain a box of
> randomly chosen samples).
> Here is a test (simplified for explanatory purposes):
> * Create a list of 1000 numbers, 0-999. Permute this list.
> * Subsample N values
> * Add them and take the median
> * Do this 20 times and record the medians
> * Calculate the standard deviation of the 20 median values
> This last is my score for 'how good is the randomness of this sampler'.
> Does this make sense? In this measurement is small or large deviation
> better? What is another way to measure it?
> Notes: Bernoulli pulls X percent of the samples and ignores the rest.
> Reservoir pulls all of the samples and saves X of them. However, it
> saves the first N samples and slowly replaces them. This suppresses
> the deviation for small samples. This realization came just now; I'll
> cut that phase.
> Really I used the OnlineSummarizer and did deviations of
> mean/median/25 percentile/75 percentile.
> I had a more detailed report with numbers, but just realized that
> given the above I have to start over.
> Barbie says: "designing experiments is hard!"
> -- 
> Lance Norskog

View raw message