commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles <>
Subject Re: [math] random boolean arrays
Date Mon, 13 Jul 2015 13:51:36 GMT
On Mon, 13 Jul 2015 06:30:44 -0700, Phil Steitz wrote:
> On 7/12/15 9:45 PM, Otmar Ertl wrote:
>> On Sun, Jul 12, 2015 at 8:16 PM, Gilles 
>> <> wrote:
>>> On Sun, 12 Jul 2015 19:38:45 +0200, Thomas Neidhart wrote:
>>>> On 07/12/2015 04:58 PM, Phil Steitz wrote:
>>>>> On 7/12/15 2:50 AM, Thomas Neidhart wrote:
>>>>>> On 07/11/2015 09:43 PM, Phil Steitz wrote:
>>>>>>> On 7/11/15 12:29 PM, Thomas Neidhart wrote:
>>>>>>>> On 07/11/2015 09:08 PM, Phil Steitz wrote:
>>>>>>>>> The code implemented in MATH-1242 to improve performance
>>>>>>>>> KS
>>>>>>>>> monteCarloP in-lines efficient generation of random boolean

>>>>>>>>> arrays.
>>>>>>>>>   Unfortunately, I think the implementation is not quite

>>>>>>>>> random (see
>>>>>>>>> comments on the ticket).  To verify it, we need to be
able to 
>>>>>>>>> test
>>>>>>>>> the random boolean array generation directly.  To do
that, we 
>>>>>>>>> have
>>>>>>>>> to either expose the method (at least as protected) in
the KS 
>>>>>>>>> class
>>>>>>>>> or add it somewhere else.  I propose the latter but am
>>>>>>>>> sure
>>>>>>>>> where to put it.  For speed, we need to avoid array copies,

>>>>>>>>> so the
>>>>>>>>> API will have to be something like randomize(boolean[],

>>>>>>>>> nTrue).  It
>>>>>>>>> could go in the swelling MathArrays class, or 
>>>>>>>>> RandomDataGenerator.
>>>>>>>>> The latter probably makes more sense, but the API does
>>>>>>>>> fit too
>>>>>>>>> well.  Any ideas?
>>>>>>>> If it is just for testing purposes, you can also access the

>>>>>>>> method in
>>>>>>>> question via reflection, see an example here:
>>>>>>> Do you think it *should* be a private method of the K-S class?
>>>>>> Right now, I do not see much uses outside the class, but if we 
>>>>>> decide to
>>>>>> make it public then I would prefer a special util class in the 
>>>>>> random
>>>>>> package to avoid cluttering the MathArrays class.
>>>>> OK, for now we can make it private and use the reflection method
>>>>> above to test it.
>>>> ok, but I guess it is also fine to make it package private as sebb
>>>> suggested. We did something similar recently for some of the 
>>>> improved
>>>> sampling methods provided by Otmar.
>>>>>> Regarding the RandomDataGenerator: I think this class should be
>>>>>> deprecated and replaced by a Sampler interface as proposed by 
>>>>>> Gilles.
>>>>> Please consider keeping this class.  Consider this a user 
>>>>> request.
>>>>> I have quite a few applications that use this class for two 
>>>>> reasons:
>>>> ok, the reason why I thought the class should be deprecated is 
>>>> because
>>>> it was not kept up-to-date with all the new discrete and 
>>>> continuous
>>>> distributions that we added in the last 2-3 years. If you think it 
>>>> is
>>>> useful, then we can keep it of course.
>>>>> 1.  One object instance tied to one PRNG that generates data from
>>>>> multiple different distributions.   This is convenient.   Sure, I
>>>>> could refactor all of these apps to instantiate new objects for 
>>>>> each
>>>>> type of generated data and hopefully still be able to peg them to
>>>>> one PRNG; but that is needless work that also complicates the 
>>>>> code.
>>>>> 2.  There are quite a few methods in this class that have nothing 
>>>>> to
>>>>> do with sampling (nextPermutation, nextHexString, nextSecureXxx,
>>>>> etc) but which conveniently share the RandomGenerator.  I guess 
>>>>> the
>>>>> utility methods get moved out somewhere else.  Again, I end up
>>>>> having to refactor all of my code that uses them and when I want
>>>>> simulations to be based on a single PRNG, I have to find a way to
>>>>> pass the RandomGenerator around to them.
>>>>> I don't yet see the need to refactor the sampling support in the
>>>>> distributions package; but as my own apps are not impacted by 
>>>>> this,
>>>>> if everyone else sees the user impact of the refactoring as
>>>>> outweighed by the benefit, I won't stand in the way.   Please 
>>>>> lets
>>>>> just keep the RandomDataGenerator convenience class in the random
>>>>> package in any case.  I will take care of whatever adjustments 
>>>>> are
>>>>> needed to adapt to whatever we settle on for sampling in the
>>>>> distributions package.
>>>> Well, it is not really necessary to do everything together and 
>>>> refactor
>>>> the distributions.
>>>> Probably it is better to start the other way round, and describe 
>>>> what I
>>>> want to add, and see how other things fit in:
>>>>  * I want a generic Sampler interface, i.e. something like this:
>>>>  ** nextSample()
>>>>  ** nextSamples(int size)
>>>>  ** nextSamples(double[] samples)
>>> +1
>>>> there could be a DiscreteSampler and ContinuousSampler interface 
>>>> to
>>>> handle the cases for int / double values.
>>> Perhaps the name should be IntegerSampler and DoubleSampler, to 
>>> accomodate
>>> future needs (LongSampler, BooleanSampler (?)).
>> I would prefer consistent names for distributions and corresponding
>> samplers. The support of distributions must be of the same data type
>> as the return values of the corresponding sampler. Therefore, I 
>> would
>> call the samplers for RealDistribution and IntegerDistribution
>> RealSampler and IntegerSampler, respectively.

Ideally yes.  But there is always the lingering question of what "Real"
means: the mathematical abstraction or the numerical representation?
The same "real" distributions could be implemented with "float"s.
If we ever need implementing a discrete distribution that is able to
use the "long" range, the "Integer" in "IntegerSampler" will clearly
refer to the numerical type, not the mathematical abstraction of an
"integer" number...


> +1
> Phil
>>>> The distributions could either be changed to return such a sampler 
>>>> as
>>>> Gilles proposed (with the advantage that no random instance is 
>>>> tied to
>>>> the distribution itself), or implement the interface directly 
>>>> (with the
>>>> advantage that we would not need to refactor too much).
>>> Unless I'm missing something, the refactoring would be fairly the 
>>> same:
>>> The latter case needs implementing 3 methods (2 new ones, one with 
>>> a name
>>> change).
>>> The former needs implementing the factory method proposed in 
>>> MATH-1158,
>>> plus the same methods as above (wrapped in the object returned by 
>>> the
>>> factory method).
>>> Gilles
>>>>>> One can then create a sampler for any distribution or from other
>>>>>> sources, e.g. when needing a fast and efficient sampler without
>>>>>> replacement (see MATH-1239).
>>>>> +1 for sequential sampling.  I don't follow exactly why that
>>>>> requires refactoring the distributions; but if it helps in a way 
>>>>> I
>>>>> don't yet understand, that will help convince me that refactoring
>>>>> sampling in the distributions package is worth the user pain.
>>>> as I said above, I wanted to combine two things in one step, maybe 
>>>> it is
>>>> better to go step by step.
>>>> Thomas
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> Otmar
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message