mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: randomseedgenerator
Date Wed, 01 Jul 2009 20:17:25 GMT
On Wed, Jul 1, 2009 at 1:03 PM, Sean Owen <srowen@gmail.com> wrote:

> Sorry I meant Mahout's class by this name.


oops.  My error really.


> Just provides a factory method which will provide a seeded or non-seeded
> java.util.Random
> depending on whether it's being run in the context of tests or not.


That sounds fine.


> I imagine all PRNGs need to be injected for us to confidently run
> tests deterministically?


Yes.  It is occasionally nice to run tests non-deterministically for
exploratory purposes, but for automated unit tests, things really should be
locked down.

Nothing in my part of the code particularly cares about the PRNG being
> hard-core random. Random is good for me.


I don't think we have anything that does just yet, but things like Gibbs'
sampling may start to show the defects.  Selecting 20 points at random is
NOT going to be a problem.


> Why does some of the code need the Mersenne twister PRNG... or does it?


I don't necessarily think that Mersenne twister is the only one that we
should consider, but any applications that consumes vats of random numbers
and then draws conclusions from them stands at risk from weak generators.
Parallel systems stand at higher risk.

Consider a program that runs 10,000 threads (8 cores on 1200 machines,
say).  Then assume that each thread consumes a hundred million (2^26) random
numbers.  With java's default generator, you have 48 bits or less of state
or about 2^18 seeds that will not overlap.  But we are running 2^13 threads
of control so it is almost certain that some threads will be going through
exactly the same sequence of random numbers as other threads.  This may not
be fatally bad, but it is definitely not good.

In contrast, if we use a generator with as little as 128 bits of entropy and
seed it with the hash of the task id, machine name and starting millisecond,
then there is essentially no chance of duplicated numbers.


> I prefer keeping it standard, then simple, if possible.


I totally agree.


> Any PRNG is cool by me, just want to pick one solution unless there is a
> compelling reason not to. Fine with one solution now, and a different
> one solution later.


We have no compelling reason at this time but we likely will later.  If our
generator is injectable, we should be reasonably future proof.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message