Return-Path: X-Original-To: apmail-datafu-dev-archive@minotaur.apache.org Delivered-To: apmail-datafu-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8AD8810E62 for ; Sun, 6 Apr 2014 12:29:48 +0000 (UTC) Received: (qmail 28191 invoked by uid 500); 6 Apr 2014 12:29:48 -0000 Delivered-To: apmail-datafu-dev-archive@datafu.apache.org Received: (qmail 28147 invoked by uid 500); 6 Apr 2014 12:29:44 -0000 Mailing-List: contact dev-help@datafu.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@datafu.incubator.apache.org Delivered-To: mailing list dev@datafu.incubator.apache.org Received: (qmail 28132 invoked by uid 99); 6 Apr 2014 12:29:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Apr 2014 12:29:40 +0000 X-ASF-Spam-Status: No, hits=-2000.3 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Sun, 06 Apr 2014 12:29:38 +0000 Received: (qmail 28112 invoked by uid 99); 6 Apr 2014 12:29:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Apr 2014 12:29:15 +0000 Date: Sun, 6 Apr 2014 12:29:14 +0000 (UTC) From: "jian wang (JIRA)" To: dev@datafu.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (DATAFU-21) Probability weighted sampling without reservoir MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/DATAFU-21?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13944= 325#comment-13944325 ]=20 jian wang edited comment on DATAFU-21 at 4/6/14 12:27 PM: ---------------------------------------------------------- Some investigation updates: Based on the theories from paper: http://utopia.duth.gr/~pefraimi/research/= data/2007EncOfAlg.pdf, I plan to associate each item with a key X(j) =3D 1 = - pow(U, 1/w(j)), U is a random variable between (0,1). Then follow the tho= ught of Random Sort, we sort the items in ascending order based on X(j) and= select the smallest k =3D p * n items. Also as simple random sampling algorithm, we could also consider the possib= ility of rejecting items applying Maurer's lemma and accepting items applyi= ng Bernstein's lemma. Apply Maurer's lemma: we would like to find 0=3D q1) * 0 =3D Pr(1 - pow( U, 1/w(j) ) < q1) =3D Pr(1 - q1 < pow( U, 1/w(j) )) =3D Pr(pow(1 - q1, w(j) ) < U) =3D 1 - pow( 1 - q1, w(j) ) E(Y(j) ^ 2) =3D E(Y(j)) =3D 1 - pow(1 - q1, w(j) ) set Y =3D sum(Y(j), j =3D 1 to n), =20 Q1 =3D sum( pow( 1-q1, w(j) ) , j =3D 1 to n) E(Y) =3D sum(E(Y(j))) =3D n - sum( pow( 1-q1, w(j) ) ) =3D n - Q1 apply Maurer's lemma with t =3D (1 - p) * n - sum( pow(1 - q1, w(j) ) ) =3D= (1 - p) * n - Q1, since t > 0, Q1 < (1 - p) * n. Solving the inequality, = I get=20 abs( Q1 - (1 - p) * n - log(err) ) >=3D sqrt( log(err) ^ 2 - 2 * p = * n * log(err) ) (1) we could get q1 by solving (1) Apply Berstein's lemma: similar to applying Maurer's lemma, we could get a q2 so that we could acce= pt item whose key is smaller than q2, 0 <=3D q2 <=3D 1. let Z(j) =3D 1 if X(j) < q2,=20 =3D 0 if X(j) >=3D q2 {Z(j), j =3D 1 to n} are independent random variables. E(Z(j)) =3D Pr(X(j) < q2) * 1 + Pr(X(j) >=3D q2) * 0 =3D Pr(1 - pow( U, 1/w(j) ) < q2) =3D 1 - pow(1 - q2, w(j) ) E(Z(j) ^ 2) =3D E(Z(j)) Z(j) - E(Z(j)) <=3D 1 - E(Z(j)) =3D pow(1 - q2, w(j) ) <=3D 1 =3D M theta(j) ^ 2 =3D E(Z(j) ^ 2) - E(Z(j)) ^ 2 <=3D E(Z(j) ^ 2) =3D 1 - pow(1 -= q2, w(j) ) set Z =3D sum(Z(j), j =3D 1 to n) Q2 =3D sum( pow(1 - q2, w(j) ) ) E(Z) =3D sum(E(Z(j)), j =3D 1 to n) =3D n - sum( pow(1 - q2, w(j) ) )= =3D n - Q2 apply Berstein's lemma with t =3D sum( pow(1 - q2, w(j) ) ) - (1 - p) * n = =3D Q2 - (1 - p) * n, I get =20 Q2 >=3D n * (1 - p) + 2 / 3 * log(err) + 2 / 3 * sqrt( log(err) * (lo= g(err) - 9 * n * p / 2 ) ) (2) we could get q2 by solving (2) Questions: (1) Please help comment on the above approach. Do they overall make sense?= =20 (2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Wo= uld like to seek some advice on it.=20 Some thoughts on how to resolve this, eg: solve (1) abs( Q1 - (1 - p) * n - log(err) ) >=3D sqrt( log(err) ^ 2 - 2 * p * n * l= og(err) ) =3D F(n, p, err) (1) Q1 =3D sum( pow( 1-q1, w(j) ) , j =3D 1 to n), 0 < q1 < 1 Remove the less-than inequality of (1), our target is to get an approximate= q1 that makes abs( Q1 - (1 - p) * n - log(err) ) close to F(n, p, err), w= e name it as q1_t.=20 was (Author: king821221): Some investigation updates: Based on the theories from paper: http://utopia.duth.gr/~pefraimi/research/= data/2007EncOfAlg.pdf, I plan to associate each item with a key X(j) =3D 1 = - pow(U, 1/w(j)), U is a random variable between (0,1). Then follow the tho= ught of Random Sort, we sort the items in ascending order based on X(j) and= select the smallest k =3D p * n items. Also as simple random sampling algorithm, we could also consider the possib= ility of rejecting items applying Maurer's lemma and accepting items applyi= ng Bernstein's lemma. Apply Maurer's lemma: we would like to find 0=3D q1) * 0 =3D Pr(1 - pow( U, 1/w(j) ) < q1) =3D Pr(1 - q1 < pow( U, 1/w(j) )) =3D Pr(pow(1 - q1, w(j) ) < U) =3D 1 - pow( 1 - q1, w(j) ) E(Y(j) ^ 2) =3D E(Y(j)) =3D 1 - pow(1 - q1, w(j) ) set Y =3D sum(Y(j), j =3D 1 to n), =20 Q1 =3D sum( pow( 1-q1, w(j) ) , j =3D 1 to n) E(Y) =3D sum(E(Y(j))) =3D n - sum( pow( 1-q1, w(j) ) ) =3D n - Q1 apply Maurer's lemma with t =3D (1 - p) * n - sum( pow(1 - q1, w(j) ) ) =3D= (1 - p) * n - Q1, since t > 0, Q1 < (1 - p) * n. Solving the inequality, = I get=20 abs( Q1 - (1 - p) * n - log(err) ) >=3D sqrt( log(err) ^ 2 - 2 * p = * n * log(err) ) Further assume Q1 < (1 - p) * n + log(err), which also satisfies t > 0, get Q1 <=3D (1 - p) * n + log(err) - sqrt( log(err) ^ 2 - 2 * p * n * = log(err) ) (1) we could get q1 by solving (1) Apply Berstein's lemma: similar to applying Maurer's lemma, we could get a q2 so that we could acce= pt item whose key is smaller than q2, 0 <=3D q2 <=3D 1. let Z(j) =3D 1 if X(j) < q2,=20 =3D 0 if X(j) >=3D q2 {Z(j), j =3D 1 to n} are independent random variables. E(Z(j)) =3D Pr(X(j) < q2) * 1 + Pr(X(j) >=3D q2) * 0 =3D Pr(1 - pow( U, 1/w(j) ) < q2) =3D 1 - pow(1 - q2, w(j) ) E(Z(j) ^ 2) =3D E(Z(j)) Z(j) - E(Z(j)) <=3D 1 - E(Z(j)) =3D pow(1 - q2, w(j) ) <=3D 1 =3D M theta(j) ^ 2 =3D E(Z(j) ^ 2) - E(Z(j)) ^ 2 <=3D E(Z(j) ^ 2) =3D 1 - pow(1 -= q2, w(j) ) set Z =3D sum(Z(j), j =3D 1 to n) Q2 =3D sum( pow(1 - q2, w(j) ) ) E(Z) =3D sum(E(Z(j)), j =3D 1 to n) =3D n - sum( pow(1 - q2, w(j) ) )= =3D n - Q2 apply Berstein's lemma with t =3D sum( pow(1 - q2, w(j) ) ) - (1 - p) * n = =3D Q2 - (1 - p) * n, I get =20 Q2 >=3D n * (1 - p) + 2 / 3 * log(err) + 2 / 3 * sqrt( log(err) * (lo= g(err) - 9 * n * p / 2 ) ) (2) we could get q2 by solving (2) Questions: (1) Please help comment on the above approach. Do they overall make sense?= =20 (2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Wo= uld like to seek some advice on it.=20 Some thoughts on how to resolve this, eg: solve (1) Q1 <=3D (1 - p) * n + log(err) - sqrt( log(err) ^ 2 - 2 * p * n * log(err= ) ) =3D F(n, p, err) (1) Q1 =3D sum( pow( 1-q1, w(j) ) , j =3D 1 to n), 0 < q1 < 1 Remove the less-than inequality of (1), our target is to get an approximate= q1 that makes Q1 close to F(n, p, err), we name it as q1_t.=20 we could observe that: (1) the value of Q1 decreases with the increase of q1. (2) F(n, p, err) >=3D Q1 >=3D sum( pow( 1-q1, wmax ) , j =3D 1 to n) = =3D n * pow( 1 - q1, wmax), wmax is MAX( w(j), j =3D 1 to n ), we could ge= t a lower bound of q1, q1 >=3D 1 - pow( F(n, p, err) / n, 1/wmax), this lo= wer bound decreases when wmax and n increases. Then we start from the lower= bound and try Newton=E2=80=93Raphson method to approach a better q1 that m= akes the value of Q1 close to F(n, p, err). After a certain number of itera= tions, we assign the final value of the predicted q1 to q1_t.=20 The newton code would be like: /*** * 1 iteration of Newton=E2=80=93Raphson * The real-valued function is: f(q) =3D (1 - q) ^ w(0) + (1 - q) ^ w(1) = + ... + (1 - q) ^ w(n - 1) - F(n, p, err) * the function's derivative is: f'(q) =3D -1 * [w(0) * (1 - q) ^ (w(0) = - 1) + (1 - q) ^ (w(1) - 1) + ... (1 - q) ^ (w(n - 1) - 1)] * given an initial value of q, calculate a better value of q' =3D q - f= (q) / f'(q) * param q: initial value of q * param F: F(n, p, err) * param weights: {w(j), j =3D 0 to n -1} ***/ static double newton(double q, double F, List weights) { double fq =3D 0; double fdq =3D 0; for (Double weight : weights) { fq +=3D Math.pow(1.0 - q, weight); fdq +=3D -1 * Math.pow(1.0 - q, weight - 1.0) * weight; } fq -=3D F; return q - fq / fdq; } > Probability weighted sampling without reservoir > ----------------------------------------------- > > Key: DATAFU-21 > URL: https://issues.apache.org/jira/browse/DATAFU-21 > Project: DataFu > Issue Type: New Feature > Environment: Mac OS, Linux > Reporter: jian wang > Assignee: jian wang > > This issue is used to track investigation on finding a weighted sampler w= ithout using internal reservoir.=20 > At present, the SimpleRandomSample has implemented a good acceptance-reje= ction sampling algo on probability random sampling. The weighted sampler co= uld utilize the simple random sample with slight modification. > One slight modification is: the present simple random sample generates a= uniform random number lies between (0, 1) as the random variable to accept= or reject an item. The weighted sample may generate this random variable b= ased on the item's weight and this random number still lies between (0, 1) = and each item's random variable remain independent between each other. > Need further think and experiment the correctness of this solution and ho= w to implement it in an effective way. -- This message was sent by Atlassian JIRA (v6.2#6252)