datafu-dev mailing list archives

Site index · List index
Message view
Top
From "Xiangrui Meng (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DATAFU-21) Probability weighted sampling without reservoir
Date Wed, 09 Apr 2014 02:50:14 GMT
```
[ https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963731#comment-13963731
]

Xiangrui Meng commented on DATAFU-21:
-------------------------------------

Jian, the main problem is not solving the equation but the size of the equation. The left-hand
side of the equation is monotone, so the simplest way to solve it is bi-section. However,
the number of weights is the same as the number of records, which may be very large. This
is why I said you need to discretize the weights and compress the data in my previous comment.
Assume that there are 10000 weights in the same partition between [0.6, 0.600001]. You can
treat all of them as 0.6 and remember the count 10000. Then, you compress the data size from
10000 to 2 (weight and count). In this way, you can solve the equation on a single reducer.

> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler without using
internal reservoir.
> At present, the SimpleRandomSample has implemented a good acceptance-rejection sampling
algo on probability random sampling. The weighted sampler could utilize the simple random
sample with slight modification.
> One slight modification is:  the present simple random sample generates a uniform random
number lies between (0, 1) as the random variable to accept or reject an item. The weighted
sample may generate this random variable based on the item's weight and this random number
still lies between (0, 1) and each item's random variable remain independent between each
other.
> Need further think and experiment the correctness of this solution and how to implement
it in an effective way.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

```
Mime
View raw message