datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jian wang (JIRA)" <>
Subject [jira] [Commented] (DATAFU-21) Probability weighted sampling without reservoir
Date Sat, 19 Apr 2014 09:55:14 GMT


jian wang commented on DATAFU-21:

Hi,sorry for the late response. 

So you mean we put the input w1, w2, ....wN into different bins and compress the original
input into {<average weight in the bin, number of elements in the bin>}, right?  There
are some questions I would like to clarify:

      (1) Do we need a separate map-reduce job to calculate the q1 and q2 and use the q1 and
q2 as parameters to our UDF in subsequent jobs, or we follow the SimpleRandomSample to try
to calculate q1, q2 on the fly to see if we could accept or reject items as they stream in?
I do not quite catch what you mean by solving equation in a single reducer or on a single
node. This way could we suffer from scalability problem?

      (2) If we would like to discretize the weights, we may need to consider the width of
bin and the level of accuracy we would like to achieve. Do we need to expose parameter for
users to control these, though as a customer, I do not quite like this, or we leverage some
rules, such as defined in

> Probability weighted sampling without reservoir
> -----------------------------------------------
>                 Key: DATAFU-21
>                 URL:
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
> This issue is used to track investigation on finding a weighted sampler without using
internal reservoir. 
> At present, the SimpleRandomSample has implemented a good acceptance-rejection sampling
algo on probability random sampling. The weighted sampler could utilize the simple random
sample with slight modification.
> One slight modification is:  the present simple random sample generates a uniform random
number lies between (0, 1) as the random variable to accept or reject an item. The weighted
sample may generate this random variable based on the item's weight and this random number
still lies between (0, 1) and each item's random variable remain independent between each
> Need further think and experiment the correctness of this solution and how to implement
it in an effective way.

This message was sent by Atlassian JIRA

View raw message