pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: count distinct using pig?
Date Wed, 18 Mar 2009 21:56:14 GMT
A =  load 'test.csv' using PigStorage(',') as (a1,a2,a3);
B  = group A by $0;
C = foreach B {
        D1 = A.a2;
        D2 = distinct D1;
        E1 = A.a3;
        E2 = distinct D2;
        generate group, COUNT(D2), COUNT(E2);
}
store C into 'output';

Alan.

On Mar 18, 2009, at 1:43 PM, Avram Aelony wrote:

> Hello Pig list,
>
> I have looked at the 'distinct' keyword but it does not seem to  
> operate on a particular fields (columns).  I have a file with  
> several categorical variables a1-a3 and am seeking to compute  
> distinct counts of fields a2 and a3 by field a1.
>
> How can I get distinct counts?
>
> For example:
> A = load 'test.csv' using PigStorage(',') as (a1,a2,a3);
> /*
> dump A;
> (x, X, a)
> (y, Y, b)
> (x, XX, b)
> (z, Z, c)
> (w, X, )
> (, W, d)
> (x, , b)
> */
>
> B = group A by $0;
> /*
> dump B;
> (, {(, W, d)})
> (w, {(w, X, )})
> (x, {(x, X, a), (x, XX, b), (x, , b)})
> (y, {(y, Y, b)})
> (z, {(z, Z, c)})
> */
>
>
> # how do I get distinct counts by $0 ??
> #Desired output:
> ,1,1
> w,1,1
> x,3,2
> y,1,1
> z,1,1
>
>
> Many thanks,
> Avram
>


Mime
View raw message