commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Neidhart (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MATH-1220) More efficient sample() method for ZipfDistribution
Date Thu, 30 Apr 2015 05:38:06 GMT

    [ https://issues.apache.org/jira/browse/MATH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520897#comment-14520897
] 

Thomas Neidhart edited comment on MATH-1220 at 4/30/15 5:37 AM:
----------------------------------------------------------------

{quote}
Furthermore, I would allow the exponent to be non-negative. Currently, it is restricted to
positive values.
{quote}

I am not sure if we should do this. Normally in literature and implementations, the exponent
is even further restricted (exp > 1), and if you look at the pmf and cdf functions for
exponents < 1 you can see that the resulting distribution is quite distorted. I guess the
typical implementation uses the Riemann zeta function which does not converge for exponents
< 1.

btw. there is something in your implementation that I do not understand:

{code}
     * An instrumental distribution g(k) is used to generate random values by rejection sampling.
     * g(k) is defined as g(1):= 1 and g(k) := I(-s,k-1/2,k+1/2) for k larger than 1,
     * where s denotes the exponent of the Zipf distribution and I(r,a,b) is the integral
of x^r for x from a to b.
     * Since 1^x^s is a convex function, Jensens's inequality gives
     * I(-s,k-1/2,k+1/2) >= 1/k^s for all positive k and non-negative s.
     * In order to limit the rejection rate for large exponents s,
     * the instrumental distribution weight is differently defined for value 1.
     */
    @Override
    public int sample() {
        if (Double.isNaN(instrumentalDistributionTailWeight)) {
            instrumentalDistributionTailWeight = integratePowerFunction(-exponent, 1.5, numberOfElements+0.5);
        }
{code}

talks about integrating the power function in the range [k-0.5, k+0.5] but in fact uses 1.5
and k+0.5 as bounds. Where does the 1.5 come from?


was (Author: tn):
{comment}
Furthermore, I would allow the exponent to be non-negative. Currently, it is restricted to
positive values.
{comment}

I am not sure if we should do this. Normally in literature and implementations, the exponent
is even further restricted (exp > 1), and if you look at the pmf and cdf functions for
exponents < 1 you can see that the resulting distribution is quite distorted. I guess the
typical implementation uses the Riemann zeta function which does not converge for exponents
< 1.

btw. there is something in your implementation that I do not understand:

{code}
     * An instrumental distribution g(k) is used to generate random values by rejection sampling.
     * g(k) is defined as g(1):= 1 and g(k) := I(-s,k-1/2,k+1/2) for k larger than 1,
     * where s denotes the exponent of the Zipf distribution and I(r,a,b) is the integral
of x^r for x from a to b.
     * Since 1^x^s is a convex function, Jensens's inequality gives
     * I(-s,k-1/2,k+1/2) >= 1/k^s for all positive k and non-negative s.
     * In order to limit the rejection rate for large exponents s,
     * the instrumental distribution weight is differently defined for value 1.
     */
    @Override
    public int sample() {
        if (Double.isNaN(instrumentalDistributionTailWeight)) {
            instrumentalDistributionTailWeight = integratePowerFunction(-exponent, 1.5, numberOfElements+0.5);
        }
{code}

talks about integrating the power function in the range [k-0.5, k+0.5] but in fact uses 1.5
and k+0.5 as bounds. Where does the 1.5 come from?

> More efficient sample() method for ZipfDistribution
> ---------------------------------------------------
>
>                 Key: MATH-1220
>                 URL: https://issues.apache.org/jira/browse/MATH-1220
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Otmar Ertl
>         Attachments: patch_v1
>
>
> Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated
by inverting the CDF. However, the current implementation uses O(N) power function evaluations
to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.)
I propose to use rejection sampling instead, which allows the generation of a single random
value in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message