# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Hector Yee <hector....@gmail.com>
Subject Re: Measuring randomness
Date Wed, 01 Jun 2011 16:10:00 GMT
```You'll probably see more difference if the data was ordered no? For the reason you said below
about R sampling the entire set and BR sampling the first X

On Jun 1, 2011, at 12:31 AM, Lance Norskog <goksron@gmail.com> wrote:

> I'm trying to do a bake-off between Bernoulli (B) sampling (drop if
> random > percentage) v.s. Reservoir (R) sampling (maintain a box of
> randomly chosen samples).
>
> Here is a test (simplified for explanatory purposes):
> * Create a list of 1000 numbers, 0-999. Permute this list.
> * Subsample N values
> * Add them and take the median
> * Do this 20 times and record the medians
> * Calculate the standard deviation of the 20 median values
> This last is my score for 'how good is the randomness of this sampler'.
>
> Does this make sense? In this measurement is small or large deviation
> better? What is another way to measure it?
>
> Notes: Bernoulli pulls X percent of the samples and ignores the rest.
> Reservoir pulls all of the samples and saves X of them. However, it
> saves the first N samples and slowly replaces them. This suppresses
> the deviation for small samples. This realization came just now; I'll
> cut that phase.
> Really I used the OnlineSummarizer and did deviations of
> mean/median/25 percentile/75 percentile.
>
> I had a more detailed report with numbers, but just realized that
> given the above I have to start over.
>
> Barbie says: "designing experiments is hard!"
>
>
> --
> Lance Norskog
> goksron@gmail.com

```
Mime
View raw message