[ https://issues.apache.org/jira/browse/DATAFU-21?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944325#comment-13944325 ]
jian wang edited comment on DATAFU-21 at 4/6/14 12:27 PM:
----------------------------------------------------------
Some investigation updates:
Based on the theories from paper: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf, I plan to associate each item with a key X(j) = 1 - pow(U, 1/w(j)), U is a random variable between (0,1). Then follow the thought of Random Sort, we sort the items in ascending order based on X(j) and select the smallest k = p * n items.
Also as simple random sampling algorithm, we could also consider the possibility of rejecting items applying Maurer's lemma and accepting items applying Bernstein's lemma.
Apply Maurer's lemma:
we would like to find 0= q1) * 0
= Pr(1 - pow( U, 1/w(j) ) < q1)
= Pr(1 - q1 < pow( U, 1/w(j) ))
= Pr(pow(1 - q1, w(j) ) < U) = 1 - pow( 1 - q1, w(j) )
E(Y(j) ^ 2) = E(Y(j)) = 1 - pow(1 - q1, w(j) )
set Y = sum(Y(j), j = 1 to n),
Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n)
E(Y) = sum(E(Y(j))) = n - sum( pow( 1-q1, w(j) ) ) = n - Q1
apply Maurer's lemma with t = (1 - p) * n - sum( pow(1 - q1, w(j) ) ) = (1 - p) * n - Q1, since t > 0, Q1 < (1 - p) * n. Solving the inequality, I get
abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * log(err) ) (1)
we could get q1 by solving (1)
Apply Berstein's lemma:
similar to applying Maurer's lemma, we could get a q2 so that we could accept item whose key is smaller than q2, 0 <= q2 <= 1.
let Z(j) = 1 if X(j) < q2,
= 0 if X(j) >= q2
{Z(j), j = 1 to n} are independent random variables.
E(Z(j)) = Pr(X(j) < q2) * 1 + Pr(X(j) >= q2) * 0
= Pr(1 - pow( U, 1/w(j) ) < q2)
= 1 - pow(1 - q2, w(j) )
E(Z(j) ^ 2) = E(Z(j))
Z(j) - E(Z(j)) <= 1 - E(Z(j)) = pow(1 - q2, w(j) ) <= 1 = M
theta(j) ^ 2 = E(Z(j) ^ 2) - E(Z(j)) ^ 2 <= E(Z(j) ^ 2) = 1 - pow(1 - q2, w(j) )
set Z = sum(Z(j), j = 1 to n)
Q2 = sum( pow(1 - q2, w(j) ) )
E(Z) = sum(E(Z(j)), j = 1 to n) = n - sum( pow(1 - q2, w(j) ) ) = n - Q2
apply Berstein's lemma with t = sum( pow(1 - q2, w(j) ) ) - (1 - p) * n = Q2 - (1 - p) * n, I get
Q2 >= n * (1 - p) + 2 / 3 * log(err) + 2 / 3 * sqrt( log(err) * (log(err) - 9 * n * p / 2 ) ) (2)
we could get q2 by solving (2)
Questions:
(1) Please help comment on the above approach. Do they overall make sense?
(2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Would like to seek some advice on it.
Some thoughts on how to resolve this, eg: solve (1)
abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * log(err) ) = F(n, p, err) (1)
Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n), 0 < q1 < 1
Remove the less-than inequality of (1), our target is to get an approximate q1 that makes abs( Q1 - (1 - p) * n - log(err) ) close to F(n, p, err), we name it as q1_t.
was (Author: king821221):
Some investigation updates:
Based on the theories from paper: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf, I plan to associate each item with a key X(j) = 1 - pow(U, 1/w(j)), U is a random variable between (0,1). Then follow the thought of Random Sort, we sort the items in ascending order based on X(j) and select the smallest k = p * n items.
Also as simple random sampling algorithm, we could also consider the possibility of rejecting items applying Maurer's lemma and accepting items applying Bernstein's lemma.
Apply Maurer's lemma:
we would like to find 0= q1) * 0
= Pr(1 - pow( U, 1/w(j) ) < q1)
= Pr(1 - q1 < pow( U, 1/w(j) ))
= Pr(pow(1 - q1, w(j) ) < U) = 1 - pow( 1 - q1, w(j) )
E(Y(j) ^ 2) = E(Y(j)) = 1 - pow(1 - q1, w(j) )
set Y = sum(Y(j), j = 1 to n),
Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n)
E(Y) = sum(E(Y(j))) = n - sum( pow( 1-q1, w(j) ) ) = n - Q1
apply Maurer's lemma with t = (1 - p) * n - sum( pow(1 - q1, w(j) ) ) = (1 - p) * n - Q1, since t > 0, Q1 < (1 - p) * n. Solving the inequality, I get
abs( Q1 - (1 - p) * n - log(err) ) >= sqrt( log(err) ^ 2 - 2 * p * n * log(err) )
Further assume Q1 < (1 - p) * n + log(err), which also satisfies t > 0, get
Q1 <= (1 - p) * n + log(err) - sqrt( log(err) ^ 2 - 2 * p * n * log(err) ) (1)
we could get q1 by solving (1)
Apply Berstein's lemma:
similar to applying Maurer's lemma, we could get a q2 so that we could accept item whose key is smaller than q2, 0 <= q2 <= 1.
let Z(j) = 1 if X(j) < q2,
= 0 if X(j) >= q2
{Z(j), j = 1 to n} are independent random variables.
E(Z(j)) = Pr(X(j) < q2) * 1 + Pr(X(j) >= q2) * 0
= Pr(1 - pow( U, 1/w(j) ) < q2)
= 1 - pow(1 - q2, w(j) )
E(Z(j) ^ 2) = E(Z(j))
Z(j) - E(Z(j)) <= 1 - E(Z(j)) = pow(1 - q2, w(j) ) <= 1 = M
theta(j) ^ 2 = E(Z(j) ^ 2) - E(Z(j)) ^ 2 <= E(Z(j) ^ 2) = 1 - pow(1 - q2, w(j) )
set Z = sum(Z(j), j = 1 to n)
Q2 = sum( pow(1 - q2, w(j) ) )
E(Z) = sum(E(Z(j)), j = 1 to n) = n - sum( pow(1 - q2, w(j) ) ) = n - Q2
apply Berstein's lemma with t = sum( pow(1 - q2, w(j) ) ) - (1 - p) * n = Q2 - (1 - p) * n, I get
Q2 >= n * (1 - p) + 2 / 3 * log(err) + 2 / 3 * sqrt( log(err) * (log(err) - 9 * n * p / 2 ) ) (2)
we could get q2 by solving (2)
Questions:
(1) Please help comment on the above approach. Do they overall make sense?
(2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Would like to seek some advice on it.
Some thoughts on how to resolve this, eg: solve (1)
Q1 <= (1 - p) * n + log(err) - sqrt( log(err) ^ 2 - 2 * p * n * log(err) ) = F(n, p, err) (1)
Q1 = sum( pow( 1-q1, w(j) ) , j = 1 to n), 0 < q1 < 1
Remove the less-than inequality of (1), our target is to get an approximate q1 that makes Q1 close to F(n, p, err), we name it as q1_t.
we could observe that:
(1) the value of Q1 decreases with the increase of q1.
(2) F(n, p, err) >= Q1 >= sum( pow( 1-q1, wmax ) , j = 1 to n) = n * pow( 1 - q1, wmax), wmax is MAX( w(j), j = 1 to n ), we could get a lower bound of q1, q1 >= 1 - pow( F(n, p, err) / n, 1/wmax), this lower bound decreases when wmax and n increases. Then we start from the lower bound and try Newton–Raphson method to approach a better q1 that makes the value of Q1 close to F(n, p, err). After a certain number of iterations, we assign the final value of the predicted q1 to q1_t.
The newton code would be like:
/***
* 1 iteration of Newton–Raphson
* The real-valued function is: f(q) = (1 - q) ^ w(0) + (1 - q) ^ w(1) + ... + (1 - q) ^ w(n - 1) - F(n, p, err)
* the function's derivative is: f'(q) = -1 * [w(0) * (1 - q) ^ (w(0) - 1) + (1 - q) ^ (w(1) - 1) + ... (1 - q) ^ (w(n - 1) - 1)]
* given an initial value of q, calculate a better value of q' = q - f(q) / f'(q)
* param q: initial value of q
* param F: F(n, p, err)
* param weights: {w(j), j = 0 to n -1}
***/
static double newton(double q, double F, List weights)
{
double fq = 0;
double fdq = 0;
for (Double weight : weights) {
fq += Math.pow(1.0 - q, weight);
fdq += -1 * Math.pow(1.0 - q, weight - 1.0) * weight;
}
fq -= F;
return q - fq / fdq;
}
> Probability weighted sampling without reservoir
> -----------------------------------------------
>
> Key: DATAFU-21
> URL: https://issues.apache.org/jira/browse/DATAFU-21
> Project: DataFu
> Issue Type: New Feature
> Environment: Mac OS, Linux
> Reporter: jian wang
> Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler without using internal reservoir.
> At present, the SimpleRandomSample has implemented a good acceptance-rejection sampling algo on probability random sampling. The weighted sampler could utilize the simple random sample with slight modification.
> One slight modification is: the present simple random sample generates a uniform random number lies between (0, 1) as the random variable to accept or reject an item. The weighted sample may generate this random variable based on the item's weight and this random number still lies between (0, 1) and each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and how to implement it in an effective way.
--
This message was sent by Atlassian JIRA
(v6.2#6252)