commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <>
Subject Re: [math] random boolean arrays
Date Mon, 13 Jul 2015 23:13:28 GMT
On 7/13/15 1:19 PM, Otmar Ertl wrote:
> On Mon, Jul 13, 2015 at 9:59 PM, Phil Steitz <> wrote:
>> On 7/13/15 12:55 PM, Phil Steitz wrote:
>>> On 7/13/15 12:16 PM, Otmar Ertl wrote:
>>>> On Mon, Jul 13, 2015 at 3:51 PM, Gilles <>
>>>> wrote:
>>>>> On Mon, 13 Jul 2015 06:30:44 -0700, Phil Steitz wrote:
>>>>>> On 7/12/15 9:45 PM, Otmar Ertl wrote:
>>>>>>> On Sun, Jul 12, 2015 at 8:16 PM, Gilles <>
>>>>>>> wrote:
>>>>>>>> On Sun, 12 Jul 2015 19:38:45 +0200, Thomas Neidhart wrote:
>>>>>>>>> On 07/12/2015 04:58 PM, Phil Steitz wrote:
>>>>>>>>>> On 7/12/15 2:50 AM, Thomas Neidhart wrote:
>>>>>>>>>>> On 07/11/2015 09:43 PM, Phil Steitz wrote:
>>>>>>>>>>>> On 7/11/15 12:29 PM, Thomas Neidhart wrote:
>>>>>>>>>>>>> On 07/11/2015 09:08 PM, Phil Steitz wrote:
>>>>>>>>>>>>>> The code implemented in MATH-1242
to improve performance of KS
>>>>>>>>>>>>>> monteCarloP in-lines efficient generation
of random boolean
>>>>>>>>>>>>>> arrays.
>>>>>>>>>>>>>> Unfortunately, I think the implementation
is not quite random
>>>>>>>>>>>>>> (see
>>>>>>>>>>>>>> comments on the ticket). To verify
it, we need to be able to test
>>>>>>>>>>>>>> the random boolean array generation
directly. To do that, we have
>>>>>>>>>>>>>> to either expose the method (at least
as protected) in the KS
>>>>>>>>>>>>>> class
>>>>>>>>>>>>>> or add it somewhere else. I propose
the latter but am not sure
>>>>>>>>>>>>>> where to put it. For speed, we need
to avoid array copies, so the
>>>>>>>>>>>>>> API will have to be something like
randomize(boolean[], nTrue).
>>>>>>>>>>>>>> It
>>>>>>>>>>>>>> could go in the swelling MathArrays
class, or
>>>> RandomDataGenerator.
>>>>>>>>>>>>>> The latter probably makes more sense,
but the API does not fit
>>>> too
>>>>>>>>>>>>>> well. Any ideas?
>>>>>>>>>>>>> If it is just for testing purposes, you
can also access the method
>>>>>>>>>>>>> in
>>>>>>>>>>>>> question via reflection, see an example
>>>>>>>>>>>> Do you think it *should* be a private method
of the K-S class?
>>>>>>>>>>> Right now, I do not see much uses outside the
class, but if we
>>>> decide
>>>>>>>>>>> to
>>>>>>>>>>> make it public then I would prefer a special
util class in the
>>>> random
>>>>>>>>>>> package to avoid cluttering the MathArrays class.
>>>>>>>>>> OK, for now we can make it private and use the reflection
>>>>>>>>>> above to test it.
>>>>>>>>> ok, but I guess it is also fine to make it package private
as sebb
>>>>>>>>> suggested. We did something similar recently for some
of the improved
>>>>>>>>> sampling methods provided by Otmar.
>>>>>>>>>>> Regarding the RandomDataGenerator: I think this
class should be
>>>>>>>>>>> deprecated and replaced by a Sampler interface
as proposed by
>>>> Gilles.
>>>>>>>>>> Please consider keeping this class. Consider this
a user request.
>>>>>>>>>> I have quite a few applications that use this class
for two reasons:
>>>>>>>>> ok, the reason why I thought the class should be deprecated
is because
>>>>>>>>> it was not kept up-to-date with all the new discrete
and continuous
>>>>>>>>> distributions that we added in the last 2-3 years. If
you think it is
>>>>>>>>> useful, then we can keep it of course.
>>>>>>>>>> 1. One object instance tied to one PRNG that generates
data from
>>>>>>>>>> multiple different distributions. This is convenient.
Sure, I
>>>>>>>>>> could refactor all of these apps to instantiate new
objects for each
>>>>>>>>>> type of generated data and hopefully still be able
to peg them to
>>>>>>>>>> one PRNG; but that is needless work that also complicates
the code.
>>>>>>>>>> 2. There are quite a few methods in this class that
have nothing to
>>>>>>>>>> do with sampling (nextPermutation, nextHexString,
>>>>>>>>>> etc) but which conveniently share the RandomGenerator.
I guess the
>>>>>>>>>> utility methods get moved out somewhere else. Again,
I end up
>>>>>>>>>> having to refactor all of my code that uses them
and when I want
>>>>>>>>>> simulations to be based on a single PRNG, I have
to find a way to
>>>>>>>>>> pass the RandomGenerator around to them.
>>>>>>>>>> I don't yet see the need to refactor the sampling
support in the
>>>>>>>>>> distributions package; but as my own apps are not
impacted by this,
>>>>>>>>>> if everyone else sees the user impact of the refactoring
>>>>>>>>>> outweighed by the benefit, I won't stand in the way.
Please lets
>>>>>>>>>> just keep the RandomDataGenerator convenience class
in the random
>>>>>>>>>> package in any case. I will take care of whatever
adjustments are
>>>>>>>>>> needed to adapt to whatever we settle on for sampling
in the
>>>>>>>>>> distributions package.
>>>>>>>>> Well, it is not really necessary to do everything together
>>>> refactor
>>>>>>>>> the distributions.
>>>>>>>>> Probably it is better to start the other way round, and
describe what
>>>> I
>>>>>>>>> want to add, and see how other things fit in:
>>>>>>>>> * I want a generic Sampler interface, i.e. something
like this:
>>>>>>>>> ** nextSample()
>>>>>>>>> ** nextSamples(int size)
>>>>>>>>> ** nextSamples(double[] samples)
>>>>>>>> +1
>>>> Do we really need all those 3 methods? If the functionality provided by the
>>>> two latter methods is essential for convenience reasons, we could also
>>>> offer utility functions that are able to fill an array with random numbers
>>>> from a given sampler, e.g. MathArrays.fill(array, sampler) or
>>>> array=MathArrays.generate(sampler, size).
>>> In some cases, there may be distribution-specific algorithms for
>>> generating sequences of values that are more efficient than just
>>> calling sample() repeatedly.  So while providing a default impl in
>>> an abstract sampler that just calls sample() repeatedly would make
>>> sense; it might also make sense to allow the distribution to
>>> override with something more efficient.   An example (not exposed
>>> now) is sampling from a Gaussian distribution.  You can see that
>>> under the covers, the BitStreamGenerator generates these in pairs.
> The simultaneous generation of multiple random values, as done by some
> algorithms (I only know the Box-Muller method), does not justify the
> extra interface methods. With negligible overhead the extra values can
> be cached and reused for subsequent calls. I prefer to keep interfaces
> as simple as possible.

+1 if its actually an interface we are talking about.  I was
thinking samplers would be abstract classes and in that case default
implementations could be provided for the array methods. 

> Otmar
>>>>>>>>> there could be a DiscreteSampler and ContinuousSampler
interface to
>>>>>>>>> handle the cases for int / double values.
>>>>>>>> Perhaps the name should be IntegerSampler and DoubleSampler,
>>>>>>>> accomodate
>>>>>>>> future needs (LongSampler, BooleanSampler (?)).
>>>>>>> I would prefer consistent names for distributions and corresponding
>>>>>>> samplers. The support of distributions must be of the same data
>>>>>>> as the return values of the corresponding sampler. Therefore,
I would
>>>>>>> call the samplers for RealDistribution and IntegerDistribution
>>>>>>> RealSampler and IntegerSampler, respectively.
>>>>> Ideally yes. But there is always the lingering question of what "Real"
>>>>> means: the mathematical abstraction or the numerical representation?
>>>>> The same "real" distributions could be implemented with "float"s.
>>>>> If we ever need implementing a discrete distribution that is able to
>>>>> use the "long" range, the "Integer" in "IntegerSampler" will clearly
>>>>> refer to the numerical type, not the mathematical abstraction of an
>>>>> "integer" number...
>>>>> Gilles
>>>> The underlying data type of RealDistribution implementations is implicitly
>>>> defined to be "double". It would not make sense that methods such as
>>>> getSupportLowerBound() return a different type than the associated sample
>>>> method. Therefore, in the context of RealDistribution "real" means "double"
>>>> at the moment. However, I am open for renaming RealDistribution as
>>>> DoubleDistribution with a sampler named DoubleSampler. If a float-based
>>>> implementation is really ever needed, interfaces named FloatDistribution
>>>> and FloatSampler could be introduced.
>>> Let's not change the names just to change them.  There is discussion
>>> in the archives that led us to call what is traditionally referred
>>> to as continuous, "Real" - mostly because we support some not
>>> absolutely continuous distribution functions among the non-discrete
>>> distributions.  We call discrete "Integer" because we force all
>>> discrete domains to map into the integers.  In the 10+ years we have
>>> offered both, we have never had a request to either support floats
>>> as representations of reals or any discrete distribution that
>>> requires the full Long value set.
>> More precisely the full Long value set as actual parameters to
>> distribution functions.  We already support distributions with
>> infinite value sets.
>> Phil
>>> Phil
>>>>>> +1
>>>>>> Phil
>>>>>>>>> The distributions could either be changed to return such
a sampler as
>>>>>>>>> Gilles proposed (with the advantage that no random instance
is tied to
>>>>>>>>> the distribution itself), or implement the interface
directly (with
>>>> the
>>>>>>>>> advantage that we would not need to refactor too much).
>>>>>>>> Unless I'm missing something, the refactoring would be fairly
the same:
>>>>>>>> The latter case needs implementing 3 methods (2 new ones,
one with a
>>>>>>>> name
>>>>>>>> change).
>>>>>>>> The former needs implementing the factory method proposed
in MATH-1158,
>>>>>>>> plus the same methods as above (wrapped in the object returned
by the
>>>>>>>> factory method).
>>>>>>>> Gilles
>>>>>>>>>>> One can then create a sampler for any distribution
or from other
>>>>>>>>>>> sources, e.g. when needing a fast and efficient
sampler without
>>>>>>>>>>> replacement (see MATH-1239).
>>>>>>>>>> +1 for sequential sampling. I don't follow exactly
why that
>>>>>>>>>> requires refactoring the distributions; but if it
helps in a way I
>>>>>>>>>> don't yet understand, that will help convince me
that refactoring
>>>>>>>>>> sampling in the distributions package is worth the
user pain.
>>>>>>>>> as I said above, I wanted to combine two things in one
step, maybe it
>>>>>>>>> is
>>>>>>>>> better to go step by step.
>>>>>>>>> Thomas
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail:
>>>>>>>> For additional commands, e-mail:
>>>>>>> Otmar
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:
>>>>>>> For additional commands, e-mail:
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:
>>>>>> For additional commands, e-mail:
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>>> For additional commands, e-mail:
>>>> Otmar
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message