mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Measuring randomness
Date Wed, 01 Jun 2011 14:33:32 GMT
On Wed, Jun 1, 2011 at 1:17 AM, Sean Owen <> wrote:

> In both cases, every element is picked with probability N/1000. That is the
> purest sense in which these processes can be wrong or right, to me, and
> they
> are both exactly as good as the underlying pseudo-random number generator.
> The difference is not their quality, but the number of elements that are
> chosen.

And how that number is specified.  And whether order is preserved.  And
whether you get samples along the way so that you can overlap computation
with I/O.

I am not sure what the distribution the median of the N values should follow
> in theory. I doubt it's Gaussian.

It is asymptotically
for pretty broad assumptions.  For normal underlying distribution, it
converges very quickly.  For a whacky underlying distribution like the
Cauchy, less quickly.

> But that would be your question then --
> how likely is it that the 20 observed values are generated by this
> distribution?

But this doesn't really answer an important question because the underlying
data was sampled from the same distribution and a variety of defective
samplers would give similar results.

> This test would not prove all aspects of the sampler work. For example, a
> sampler that never picked 0 or 999 would have the same result (well, if
> N>2)
> as this one, when clearly it has a problem.

And I think that this sort of thing is the key question.

Make sure that you use sorted data as one test input.  Do a full median of
the samples because OnlineSummarizer doesn't like ordered data.

> But I think this is probably a more complicated question than you need ask
> in practice: what is the phenomenon you are worried will happen or not
> happen here?

Since the samplers are equal in quality by design, the only problem I can
imagine is code error.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message