Mailing-List: contact dev-help@datafu.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@datafu.incubator.apache.org
Date: Sun, 6 Apr 2014 12:29:14 +0000 (UTC)
From: "jian wang (JIRA)" <jira@apache.org>
To: dev@datafu.incubator.apache.org
Message-ID: <JIRA.12691185.1390623924140.70966.1396787354633@arcas>
In-Reply-To: <JIRA.12691185.1390623924140@arcas>
References: <JIRA.12691185.1390623924140@arcas>
Subject: [jira] [Comment Edited] (DATAFU-21) Probability weighted sampling
 without reservoir
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/DATAFU-21?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13944=
325#comment-13944325 ]=20

jian wang edited comment on DATAFU-21 at 4/6/14 12:27 PM:
----------------------------------------------------------

Some investigation updates:

Based on the theories from paper: http://utopia.duth.gr/~pefraimi/research/=
data/2007EncOfAlg.pdf, I plan to associate each item with a key X(j) =3D 1 =
- pow(U, 1/w(j)), U is a random variable between (0,1). Then follow the tho=
ught of Random Sort, we sort the items in ascending order based on X(j) and=
 select the smallest k =3D p * n items.

Also as simple random sampling algorithm, we could also consider the possib=
ility of rejecting items applying Maurer's lemma and accepting items applyi=
ng Bernstein's lemma.

Apply Maurer's lemma:

we would like to find 0<q1<1, so that we reject items whose  key is greater=
 than q1.

let Y(j) =3D 1 if (X(j) < q1)
           =3D  0 otherwise

{Y(j), j =3D 1 to n} are independent random variables.

E(Y(j)) =3D Pr(X(j) < q1) * 1 + Pr(X(j) >=3D q1) * 0
            =3D Pr(1 - pow( U, 1/w(j) ) < q1)
            =3D Pr(1 - q1 < pow( U, 1/w(j) ))
            =3D Pr(pow(1 - q1, w(j) ) < U) =3D 1 - pow( 1 - q1, w(j) )

E(Y(j) ^ 2) =3D E(Y(j)) =3D 1 - pow(1 - q1, w(j) )

set Y =3D sum(Y(j), j =3D 1 to n), =20

      Q1 =3D sum( pow( 1-q1, w(j) ) , j =3D 1 to n)

E(Y) =3D sum(E(Y(j))) =3D n - sum( pow( 1-q1, w(j) ) ) =3D n - Q1

apply Maurer's lemma with t =3D (1 - p) * n - sum( pow(1 - q1, w(j) ) ) =3D=
 (1 - p) * n - Q1, since t > 0, Q1 < (1 - p) * n.  Solving the inequality, =
I get=20
        abs( Q1 - (1 - p) * n - log(err) ) >=3D sqrt( log(err) ^ 2 - 2 * p =
* n * log(err) )    (1)

we could get q1 by solving (1)

Apply Berstein's lemma:

similar to applying Maurer's lemma, we could get a q2 so that we could acce=
pt item whose key is smaller than q2, 0 <=3D q2 <=3D 1.

let  Z(j) =3D 1 if X(j) < q2,=20
            =3D 0 if X(j) >=3D q2

{Z(j), j =3D 1 to n} are independent random variables.

E(Z(j)) =3D Pr(X(j) < q2) * 1 + Pr(X(j) >=3D q2) *  0
           =3D Pr(1 - pow( U, 1/w(j) ) < q2)
           =3D 1 - pow(1 - q2, w(j) )

E(Z(j) ^ 2) =3D E(Z(j))

Z(j) - E(Z(j)) <=3D 1 - E(Z(j)) =3D pow(1 - q2, w(j) ) <=3D 1 =3D M

theta(j) ^ 2 =3D E(Z(j) ^ 2) - E(Z(j)) ^ 2 <=3D E(Z(j) ^ 2) =3D 1 - pow(1 -=
 q2, w(j) )

set Z =3D sum(Z(j),  j =3D 1 to n)

      Q2 =3D sum( pow(1 - q2, w(j) ) )

      E(Z) =3D sum(E(Z(j)), j =3D 1 to n) =3D n - sum( pow(1 - q2, w(j) ) )=
 =3D n - Q2

apply Berstein's lemma with t =3D sum( pow(1 - q2, w(j) ) ) - (1 - p) * n =
=3D Q2 -  (1 - p) * n, I get =20
    Q2  >=3D n * (1 - p) + 2 / 3 * log(err)  + 2 / 3 * sqrt( log(err) * (lo=
g(err) - 9 * n * p / 2 ) )     (2)

we could get q2 by solving (2)

Questions:

(1) Please help comment on the above approach. Do they overall make sense?=
=20

(2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Wo=
uld like to seek some advice on it.=20

Some thoughts on how to resolve this, eg: solve (1)

 abs( Q1 - (1 - p) * n - log(err) ) >=3D sqrt( log(err) ^ 2 - 2 * p * n * l=
og(err) )  =3D F(n, p, err)     (1)

 Q1 =3D sum( pow( 1-q1, w(j) ) , j =3D 1 to n),  0 < q1 < 1

Remove the less-than inequality of (1), our target is to get an approximate=
 q1 that makes abs( Q1 - (1 - p) * n - log(err) )  close to F(n, p, err), w=
e name it as q1_t.=20


was (Author: king821221):
Some investigation updates:

Based on the theories from paper: http://utopia.duth.gr/~pefraimi/research/=
data/2007EncOfAlg.pdf, I plan to associate each item with a key X(j) =3D 1 =
- pow(U, 1/w(j)), U is a random variable between (0,1). Then follow the tho=
ught of Random Sort, we sort the items in ascending order based on X(j) and=
 select the smallest k =3D p * n items.

Also as simple random sampling algorithm, we could also consider the possib=
ility of rejecting items applying Maurer's lemma and accepting items applyi=
ng Bernstein's lemma.

Apply Maurer's lemma:

we would like to find 0<q1<1, so that we reject items whose  key is greater=
 than q1.

let Y(j) =3D 1 if (X(j) < q1)
           =3D  0 otherwise

{Y(j), j =3D 1 to n} are independent random variables.

E(Y(j)) =3D Pr(X(j) < q1) * 1 + Pr(X(j) >=3D q1) * 0
            =3D Pr(1 - pow( U, 1/w(j) ) < q1)
            =3D Pr(1 - q1 < pow( U, 1/w(j) ))
            =3D Pr(pow(1 - q1, w(j) ) < U) =3D 1 - pow( 1 - q1, w(j) )

E(Y(j) ^ 2) =3D E(Y(j)) =3D 1 - pow(1 - q1, w(j) )

set Y =3D sum(Y(j), j =3D 1 to n), =20

      Q1 =3D sum( pow( 1-q1, w(j) ) , j =3D 1 to n)

E(Y) =3D sum(E(Y(j))) =3D n - sum( pow( 1-q1, w(j) ) ) =3D n - Q1

apply Maurer's lemma with t =3D (1 - p) * n - sum( pow(1 - q1, w(j) ) ) =3D=
 (1 - p) * n - Q1, since t > 0, Q1 < (1 - p) * n.  Solving the inequality, =
I get=20
        abs( Q1 - (1 - p) * n - log(err) ) >=3D sqrt( log(err) ^ 2 - 2 * p =
* n * log(err) )
Further assume Q1 < (1 - p) * n + log(err), which also satisfies t > 0, get
        Q1  <=3D (1 - p) * n + log(err) - sqrt( log(err) ^ 2 - 2 * p * n * =
log(err) )     (1)

we could get q1 by solving (1)

Apply Berstein's lemma:

similar to applying Maurer's lemma, we could get a q2 so that we could acce=
pt item whose key is smaller than q2, 0 <=3D q2 <=3D 1.

let  Z(j) =3D 1 if X(j) < q2,=20
            =3D 0 if X(j) >=3D q2

{Z(j), j =3D 1 to n} are independent random variables.

E(Z(j)) =3D Pr(X(j) < q2) * 1 + Pr(X(j) >=3D q2) *  0
           =3D Pr(1 - pow( U, 1/w(j) ) < q2)
           =3D 1 - pow(1 - q2, w(j) )

E(Z(j) ^ 2) =3D E(Z(j))

Z(j) - E(Z(j)) <=3D 1 - E(Z(j)) =3D pow(1 - q2, w(j) ) <=3D 1 =3D M

theta(j) ^ 2 =3D E(Z(j) ^ 2) - E(Z(j)) ^ 2 <=3D E(Z(j) ^ 2) =3D 1 - pow(1 -=
 q2, w(j) )

set Z =3D sum(Z(j),  j =3D 1 to n)

      Q2 =3D sum( pow(1 - q2, w(j) ) )

      E(Z) =3D sum(E(Z(j)), j =3D 1 to n) =3D n - sum( pow(1 - q2, w(j) ) )=
 =3D n - Q2

apply Berstein's lemma with t =3D sum( pow(1 - q2, w(j) ) ) - (1 - p) * n =
=3D Q2 -  (1 - p) * n, I get =20
    Q2  >=3D n * (1 - p) + 2 / 3 * log(err)  + 2 / 3 * sqrt( log(err) * (lo=
g(err) - 9 * n * p / 2 ) )     (2)

we could get q2 by solving (2)

Questions:

(1) Please help comment on the above approach. Do they overall make sense?=
=20

(2) I am stuck in getting q1 and q2 by solving (1) and (2) respectively. Wo=
uld like to seek some advice on it.=20

Some thoughts on how to resolve this, eg: solve (1)

 Q1  <=3D (1 - p) * n + log(err) - sqrt( log(err) ^ 2 - 2 * p * n * log(err=
) )  =3D F(n, p, err)     (1)

 Q1 =3D sum( pow( 1-q1, w(j) ) , j =3D 1 to n),  0 < q1 < 1

Remove the less-than inequality of (1), our target is to get an approximate=
 q1 that makes Q1 close to F(n, p, err), we name it as q1_t.=20

we could observe that:
     (1) the value of Q1 decreases with the increase of q1.
     (2) F(n, p, err) >=3D Q1 >=3D  sum( pow( 1-q1, wmax ) , j =3D 1 to n) =
=3D n * pow( 1 - q1, wmax),  wmax is MAX( w(j), j =3D 1 to n ), we could ge=
t a lower bound of q1, q1 >=3D 1 - pow( F(n, p, err) / n, 1/wmax),  this lo=
wer bound decreases when wmax and n increases. Then we start from the lower=
 bound and try Newton=E2=80=93Raphson method to approach a better q1 that m=
akes the value of Q1 close to F(n, p, err). After a certain number of itera=
tions, we assign the final value of the predicted q1 to q1_t.=20

The newton code would be like:
/***
   * 1 iteration of Newton=E2=80=93Raphson
   * The real-valued function is: f(q) =3D (1 - q) ^ w(0) + (1 - q) ^ w(1) =
+ ... + (1 - q) ^ w(n - 1) - F(n, p, err)
   *  the function's derivative is: f'(q) =3D -1 * [w(0) * (1 - q) ^ (w(0) =
- 1) + (1 - q) ^ (w(1) - 1) + ... (1 - q) ^ (w(n - 1) - 1)]
   *  given an initial value of q, calculate a better value of q' =3D q - f=
(q) / f'(q)
   * param q: initial value of q
   * param F: F(n, p, err)
   * param weights: {w(j), j =3D 0 to n -1}
***/
static double newton(double q, double F, List<Double> weights)
{
    double fq =3D 0;
    double fdq =3D 0;
    for (Double weight : weights) {
         fq +=3D Math.pow(1.0 - q, weight);
         fdq +=3D -1 * Math.pow(1.0 - q, weight - 1.0) * weight;
    }
    fq -=3D F;
    return q - fq / fdq;
}


> Probability weighted sampling without reservoir
> -----------------------------------------------
>
>                 Key: DATAFU-21
>                 URL: https://issues.apache.org/jira/browse/DATAFU-21
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac OS, Linux
>            Reporter: jian wang
>            Assignee: jian wang
>
> This issue is used to track investigation on finding a weighted sampler w=
ithout using internal reservoir.=20
> At present, the SimpleRandomSample has implemented a good acceptance-reje=
ction sampling algo on probability random sampling. The weighted sampler co=
uld utilize the simple random sample with slight modification.
> One slight modification is:  the present simple random sample generates a=
 uniform random number lies between (0, 1) as the random variable to accept=
 or reject an item. The weighted sample may generate this random variable b=
ased on the item's weight and this random number still lies between (0, 1) =
and each item's random variable remain independent between each other.
> Need further think and experiment the correctness of this solution and ho=
w to implement it in an effective way.


--
This message was sent by Atlassian JIRA
(v6.2#6252)