# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Sean Owen <sro...@gmail.com>
Subject Re: Measuring randomness
Date Wed, 01 Jun 2011 08:17:05 GMT
```In both cases, every element is picked with probability N/1000. That is the
purest sense in which these processes can be wrong or right, to me, and they
are both exactly as good as the underlying pseudo-random number generator.
The difference is not their quality, but the number of elements that are
chosen.

I am not sure what the distribution the median of the N values should follow
in theory. I doubt it's Gaussian. But that would be your question then --
how likely is it that the 20 observed values are generated by this
distribution?

This test would not prove all aspects of the sampler work. For example, a
sampler that never picked 0 or 999 would have the same result (well, if N>2)
as this one, when clearly it has a problem.

But I think this is probably a more complicated question than you need ask
in practice: what is the phenomenon you are worried will happen or not
happen here?

On Wed, Jun 1, 2011 at 8:31 AM, Lance Norskog <goksron@gmail.com> wrote:

> I'm trying to do a bake-off between Bernoulli (B) sampling (drop if
> random > percentage) v.s. Reservoir (R) sampling (maintain a box of
> randomly chosen samples).
>
> Here is a test (simplified for explanatory purposes):
> * Create a list of 1000 numbers, 0-999. Permute this list.
> * Subsample N values
> * Add them and take the median
> * Do this 20 times and record the medians
> * Calculate the standard deviation of the 20 median values
> This last is my score for 'how good is the randomness of this sampler'.
>
> Does this make sense? In this measurement is small or large deviation
> better? What is another way to measure it?
>
> Notes: Bernoulli pulls X percent of the samples and ignores the rest.
> Reservoir pulls all of the samples and saves X of them. However, it
> saves the first N samples and slowly replaces them. This suppresses
> the deviation for small samples. This realization came just now; I'll
> cut that phase.
> Really I used the OnlineSummarizer and did deviations of
> mean/median/25 percentile/75 percentile.
>
> I had a more detailed report with numbers, but just realized that
> given the above I have to start over.
>
> Barbie says: "designing experiments is hard!"
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

```
Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message