commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otmar Ertl (JIRA)" <>
Subject [jira] [Commented] (MATH-1220) More efficient sample() method for ZipfDistribution
Date Mon, 27 Apr 2015 15:10:40 GMT


Otmar Ertl commented on MATH-1220:

Do you mean using a pre-calculated CDF tanle and sampling using binary search? This would
have O(N) initialization costs, O(N) space costs, and sampling would require O(log(N)) time.
All that is constant for the method I have proposed. If initialization and memory costs are
not an issue, sampling using such a pre-calculated table for the CDF is probably faster for
smaller N. 

The proposed rejection sampling algorithm divides the support into two parts. The head which
currently only consists of the first point (equal to 1) and the tail which consists of all
remaining points. Rejection sampling is only used to sample points from the tail. Instead,
it would be possible to take the first X (<=N) points as the head and to use such a pre-calculated
CDF table in combination with binary serach for sampling points from the head. For the tail
still the rejection method can be applied. When developing that algorithm I had that in mind,
but finally decided to set X = 1 to minimize the initialization costs. The optimal value for
X is likely to be use case dependent and difficult to determine. I do not think that a pure
CDF table based approach (X=N) is feasible for all N, because initialization and memory costs
could get very large.

> More efficient sample() method for ZipfDistribution
> ---------------------------------------------------
>                 Key: MATH-1220
>                 URL:
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Otmar Ertl
>         Attachments: patch_v1
> Currently, sampling from a ZipfDistribution is very inefficient. Random values are generated
by inverting the CDF. However, the current implementation uses O(N) power function evaluations
to calculate the CDF for some point. (Here N is the number of points of the Zipf distribution.)
I propose to use rejection sampling instead, which allows the generation of a single random
value in constant time.

This message was sent by Atlassian JIRA

View raw message