# datafu-dev mailing list archives

##### Site index · List index
Message view
Top
From "jian wang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DATAFU-21) Probability weighted sampling without reservoir
Date Tue, 22 Apr 2014 01:25:15 GMT
```
[ https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944325#comment-13944325
]

jian wang edited comment on DATAFU-21 at 4/22/14 1:23 AM:
----------------------------------------------------------

Based on the theories from paper: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf,
I plan to associate each item with a key X(j) = 1 - pow(U, 1/w(j)), U is a random variable
between (0,1). Then follow the thought of Random Sort, we sort the items in ascending order
based on X(j) and select the smallest k = p * n items.

Also as simple random sampling algorithm, we could also consider the possibility of rejecting
items applying Maurer's lemma and accepting items applying Bernstein's lemma.

Apply Maurer's lemma:

we would like to find 0<q1<1, so that we reject items whose  key is greater than q1.

let Y(j) = 1 if (X(j) < q1)
=  0 otherwise

{Y(j), j = 1 to n} are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0
= Pr(1 - pow( U, 1/w(j) ) < q1)
= Pr(1 - q1 < pow( U, 1/w(j) ))
= Pr(pow(1 - q1, w(j) ) < U) = 1 - pow( 1 - q1, w(j) )

E(Y(j) ^ 2) = E(Y(j)) = 1 - pow(1 - q1, w(j) )

set Y = sum(Y(j), j = 1 to n),

Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n)

E(Y) = sum(E(Y(j))) = n - sum( pow( 1-q1, w(j) ) ) = n - Q1

apply Maurer's lemma with t = (1 - p) * n - sum( pow(1 - q1, w(j) ) ) = (1 - p) * n - Q1,
since t > 0, Q1 < (1 - p) * n.  Solving the inequality, I get
abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * log(err)
)    (1)

we could get q1 by solving (1)

Apply Berstein's lemma:

similar to applying Maurer's lemma, we could get a q2 so that we could accept item whose key
is smaller than q2, 0 <= q2 <= 1.

let  Z(j) = 1 if X(j) < q2,
= 0 if X(j) >= q2

{Z(j), j = 1 to n} are independent random variables.

E(Z(j)) = Pr(X(j) < q2) * 1 + Pr(X(j) >= q2) *  0
= Pr(1 - pow( U, 1/w(j) ) < q2)
= 1 - pow(1 - q2, w(j) )

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) <= 1 - E(Z(j)) = pow(1 - q2, w(j) ) <= 1 = M

theta(j) ^ 2 = E(Z(j) ^ 2) - E(Z(j)) ^ 2 <= E(Z(j) ^ 2) = 1 - pow(1 - q2, w(j) )

set Z = sum(Z(j),  j = 1 to n)

Q2 = sum( pow(1 - q2, w(j) ) )

E(Z) = sum(E(Z(j)), j = 1 to n) = n - sum( pow(1 - q2, w(j) ) ) = n - Q2

apply Berstein's lemma with t = sum( pow(1 - q2, w(j) ) ) - (1 - p) * n = Q2 -  (1 - p) *
n, I get
abs(Q2  - n * (1 - p) - 2 / 3 * log(err) ) <=  2 / 3 * sqrt( log(err) * log(err) -
9 * n * p / 2 * log(err) )     (2)

we could get q2 by solving (2)

Questions:

(2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Would like to seek

Some thoughts on how to resolve this, eg: solve (1)

abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * log(err) )  = F(n,
p, err)     (1)

Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n),  0 < q1 < 1

Q1' = sum( -1 * w(j) * (1 - q1) ^ (w(j) - 1) ), 0 < q1 < 1,  Q1' means the 1st derivative
of Q1.

Remove the less-than inequality of (1), our target is to get an approximate q1 that makes
abs( Q1 - (1 - p) * n - log(err) )  close to F(n, p, err), we name it as q1_t.

Set function f(q1) = abs( Q1 - (1 - p) * n - log(err) )  - F(n, p, err). Use Newton–Raphson
method to get q1_t so that f(q1_t) is close to 0.

was (Author: king821221):

Based on the theories from paper: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf,
I plan to associate each item with a key X(j) = 1 - pow(U, 1/w(j)), U is a random variable
between (0,1). Then follow the thought of Random Sort, we sort the items in ascending order
based on X(j) and select the smallest k = p * n items.

Also as simple random sampling algorithm, we could also consider the possibility of rejecting
items applying Maurer's lemma and accepting items applying Bernstein's lemma.

Apply Maurer's lemma:

we would like to find 0<q1<1, so that we reject items whose  key is greater than q1.

let Y(j) = 1 if (X(j) < q1)
=  0 otherwise

{Y(j), j = 1 to n} are independent random variables.

E(Y(j)) = Pr(X(j) < q1) * 1 + Pr(X(j) >= q1) * 0
= Pr(1 - pow( U, 1/w(j) ) < q1)
= Pr(1 - q1 < pow( U, 1/w(j) ))
= Pr(pow(1 - q1, w(j) ) < U) = 1 - pow( 1 - q1, w(j) )

E(Y(j) ^ 2) = E(Y(j)) = 1 - pow(1 - q1, w(j) )

set Y = sum(Y(j), j = 1 to n),

Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n)

E(Y) = sum(E(Y(j))) = n - sum( pow( 1-q1, w(j) ) ) = n - Q1

apply Maurer's lemma with t = (1 - p) * n - sum( pow(1 - q1, w(j) ) ) = (1 - p) * n - Q1,
since t > 0, Q1 < (1 - p) * n.  Solving the inequality, I get
abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * log(err)
)    (1)

we could get q1 by solving (1)

Apply Berstein's lemma:

similar to applying Maurer's lemma, we could get a q2 so that we could accept item whose key
is smaller than q2, 0 <= q2 <= 1.

let  Z(j) = 1 if X(j) < q2,
= 0 if X(j) >= q2

{Z(j), j = 1 to n} are independent random variables.

E(Z(j)) = Pr(X(j) < q2) * 1 + Pr(X(j) >= q2) *  0
= Pr(1 - pow( U, 1/w(j) ) < q2)
= 1 - pow(1 - q2, w(j) )

E(Z(j) ^ 2) = E(Z(j))

Z(j) - E(Z(j)) <= 1 - E(Z(j)) = pow(1 - q2, w(j) ) <= 1 = M

theta(j) ^ 2 = E(Z(j) ^ 2) - E(Z(j)) ^ 2 <= E(Z(j) ^ 2) = 1 - pow(1 - q2, w(j) )

set Z = sum(Z(j),  j = 1 to n)

Q2 = sum( pow(1 - q2, w(j) ) )

E(Z) = sum(E(Z(j)), j = 1 to n) = n - sum( pow(1 - q2, w(j) ) ) = n - Q2

apply Berstein's lemma with t = sum( pow(1 - q2, w(j) ) ) - (1 - p) * n = Q2 -  (1 - p) *
n, I get
Q2  >= n * (1 - p) + 2 / 3 * log(err)  + 2 / 3 * sqrt( log(err) * (log(err) - 9 * n
* p / 2 ) )     (2)

we could get q2 by solving (2)

Questions:

(2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Would like to seek

Some thoughts on how to resolve this, eg: solve (1)

abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * log(err) )  = F(n,
p, err)     (1)

Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n),  0 < q1 < 1

Q1' = sum( -1 * w(j) * (1 - q1) ^ (w(j) - 1) ), 0 < q1 < 1,  Q1' means the 1st derivative
of Q1.

Remove the less-than inequality of (1), our target is to get an approximate q1 that makes
abs( Q1 - (1 - p) * n - log(err) )  close to F(n, p, err), we name it as q1_t.

Set function f(q1) = abs( Q1 - (1 - p) * n - log(err) )  - F(n, p, err). Use Newton–Raphson
method to get q1_t so that f(q1_t) is close to 0.

f'(q1) = Q1', f'(q1) is the 1st derivative of f(q1).

> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler without using
internal reservoir.
> At present, the SimpleRandomSample has implemented a good acceptance-rejection sampling
algo on probability random sampling. The weighted sampler could utilize the simple random
sample with slight modification.
> One slight modification is:  the present simple random sample generates a uniform random
number lies between (0, 1) as the random variable to accept or reject an item. The weighted
sample may generate this random variable based on the item's weight and this random number
still lies between (0, 1) and each item's random variable remain independent between each
other.
> Need further think and experiment the correctness of this solution and how to implement
it in an effective way.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

```
Mime
View raw message