pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Barclay Dunn <barclay.d...@gmail.com>
Subject Re: comparing two files using pig
Date Fri, 21 Jun 2013 13:44:26 GMT
The introductory theory of this is awesome. "A+++++ would read again" ;)

On 6/21/13 9:38 AM, Jacob Perkins wrote:
> Now here's where it gets fun :)
>
> First, I do want to show you that (given sufficient coffee) there is a set theoretic
approach to your first question that allows you to solve it with just one map-reduce job (a
single cogroup) and not two (a cogroup followed by a group). Consider two sets, A and B where
|A| is the number of elements in A and |B| is the number of elements in B.
>
> Let |AUB| be the size of the set union of A and B. Note, Pig does not have a set union
operator. The UNION operator in Pig is a misnomer. Plus, you cant use it in a nested projection
which is frustrating...
> Let |A^B| be the size of the set intersection of A and B. (The number of elements that
are in BOTH A and B.
>
> What you're technically after is |A^B|. However, since Pig does not have a set intersection
operator, and I'm assuming writing a UDF is out of the question for you, we can be a bit more
clever. As it turns out Pig has a DIFF operator. It takes two bags (basically sets although
duplicate elements are allowed) and returns all the elements that are in either bag but NOT
in both. Notice:
>
> |AUB| = |A^B| + |DIFF(A,B)| and
> |AUB| = |A| + |B| - |A^B| therefor
>
> |A^B| = 1/2*( |A| + |B| - |DIFF(A,B)| )
>
> All of which we can compute with native Pig :)
>
> So:
>
> A = load 'file1.txt' as (q:chararray, d:chararray);
> B = load 'file2.txt' as (q:chararray, d:chararray);
>
> counts = foreach (cogroup A by q, B by q) {
>             a_size     = COUNT(A);         -- |A|
>             b_size     = COUNT(B);         -- |B|
>             diff_size  = COUNT(DIFF(A,B)); -- |DIFF(A,B)
>             match_size = (a_size + b_size - diff_size)/2l; -- 1/2*(|A| + |B| - |DIFF(A,B)|)
= |A intersect B|
>             generate
>               group as q,
>               match_size;
>           };
>
> dump counts;
>
>
>
> Alright, back to your other issue of adding the matching elements. Again, if you were
up for it, you could simply write a set intersection udf and be done with it. Otherwise, here's
what I came up with:
>
>
> A = load 'file1.txt' as (q:chararray, d:chararray);
> B = load 'file2.txt' as (q:chararray, d:chararray);
>
> counts = foreach (cogroup A by (q,d), B by (q,d)) {
>              num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>              generate
>                flatten(group) as (q,d),
>                num_matches    as num_matches;
>            };
>
> all_matches = foreach (group counts by q) {
>                  match_set = filter counts by num_matches > 0;
>                  match_set = match_set.d;
>                  generate
>                    group as q,
>                    SUM(counts.num_matches) as total_matches,
>                    match_set as match_set;
>                };
>                
>
> dump all_matches;
>
> (q1,2,{(d1),(d2)})
> (q2,0,{})
> (q3,0,{})
>
> The empty curly braces indicate bags that contain no tuples.
>
> --jacob
> @thedatachef
>
> On Jun 21, 2013, at 6:14 AM, Siddhi Borkar wrote:
>
>> Thanks a lot the solution worked fine. Is it possible also to display the comma separated
matching d's?
>>
>> For ex
>> (q1,2, {d1,d2})
>> (q2,0)
>> (q3,0)
>>
>> -----Original Message-----
>> From: Chris Hokamp [mailto:chris.hokamp@gmail.com]
>> Sent: Friday, June 21, 2013 1:52 AM
>> To: user@pig.apache.org; Barclay Dunn
>> Subject: Re: comparing two files using pig
>>
>> Z
>>
>>
>> Sent from Samsung Mobile
>>
>> -------- Original message --------
>> From: Jacob Perkins <jacob.a.perkins@gmail.com>
>> Date: 20/06/2013  20:30  (GMT+00:00)
>> To: Barclay Dunn <barclay.dunn@gmail.com>
>> Cc: user@pig.apache.org
>> Subject: Re: comparing two files using pig
>>
>> I did not read you original post clearly enough. I didn't realize both the d AND
the q had to match. It's only slightly more complex, just add the d column to the cogroup
statement and sum the number of matches:
>>
>> A = load 'file1.txt' as (q:chararray, d:chararray); B = load 'file2.txt' as (q:chararray,
d:chararray);
>>
>> counts = foreach (cogroup A by (q,d), B by (q,d)) {
>>              num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>>              generate
>>                flatten(group) as (q,d),
>>                num_matches    as num_matches;
>>            };
>>
>> all_matches = foreach (group counts by q) generate group as q,
>> SUM(counts.num_matches) as total_matches;
>>
>> dump all_matches;
>>
>> (q1,2)
>> (q2,0)
>> (q3,0)
>>
>> --jacob
>> @thedatachef
>>
>> On 06/20/2013 02:06 PM, Barclay Dunn wrote:
>>> Jacob,
>>>
>>> If I run that code with an added row in file2.txt, e.g.,
>>>
>>>    $ cat file2.txt
>>> q1 d1
>>> q1 d2
>>> q3 d3
>>> q2 d4
>>>
>>> This gives me mistaken results, i.e.,
>>>
>>> (q1,2)
>>> (q2,1)
>>> (q3,0)
>>>
>>>
>>> I am new at this so I apologize for the ponderous pace of the
>>> following. It can no doubt be shortened. But it gets the correct
>>> results with either data set.
>>>
>>> set io.sort.mb 10;         -- avoid java.lang.OutOfMemoryError: Java
>>> heap space (execmode: -x local)
>>>
>>> A = LOAD '../../../input/file1.txt' using PigStorage(' ') as
>>> (aa:chararray, ab:chararray); B = LOAD '../../../input/file2.txt'
>>> using PigStorage(' ') as (ba:chararray, bb:chararray);
>>>
>>> C = UNION A, B;
>>> D = COGROUP C by ($0, $1);
>>>
>>> F = FOREACH D GENERATE FLATTEN($0), COUNT($1);
>>>
>>> G0 = FILTER F BY $2 > 1;   -- any that match
>>> G1 = FILTER F BY $2 < 2;   -- any that don't match
>>>
>>> H0 = GROUP G0 BY $0;
>>> H1 = GROUP G1 BY $0;
>>>
>>>
>>> J0 = FOREACH H0 GENERATE $0, COUNT($1);
>>> J1 = FOREACH H1 GENERATE $0, 0;
>>>
>>> K = UNION J0, J1;
>>>
>>> DUMP K;
>>> /*
>>> (q2,0)
>>> (q3,0)
>>> (q1,2)
>>> */
>>>
>>>
>>> Barclay Dunn
>>>
>>>
>>> On 6/20/13 10:07 AM, Jacob Perkins wrote:
>>>> Hi,
>>>>
>>>> This should just be a simple cogroup.
>>>>
>>>> A = load 'file1.txt' as (q:chararray, d:chararray); B = load
>>>> 'file2.txt' as (q:chararray, d:chararray);
>>>>
>>>> counts = foreach (cogroup A by q, B by q) {
>>>>                    num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>>>>                    generate
>>>>                      group       as q,
>>>>                      num_matches as num_matches;
>>>>                 };
>>>>
>>>> dump counts;
>>>>
>>>> (q1,2)
>>>> (q2,0)
>>>> (q3,0)
>>>>
>>>> --jacob
>>>> @thedatachef
>>>>
>>>> On Jun 20, 2013, at 4:00 AM, Siddhi Borkar wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a problem statement where in I have to compare two files and get
the count of matching attributes.
>>>>>
>>>>> For ex:
>>>>> File 1:  file1.txt
>>>>>
>>>>> q1           d1
>>>>> q1           d2
>>>>> q2           d3
>>>>> q2           d1
>>>>>
>>>>> File 2: file2.txt
>>>>> q1           d1
>>>>> q1           d2
>>>>> q3           d3
>>>>>
>>>>> Now what I need is for each distinct q  the count of matching d's
>>>>>
>>>>> For ex, the output should be
>>>>> q1           2  (q1     d1 and q1            d2 are matching in both
>>>>> the files hence count is 2)
>>>>> q2           0 (has no d's matching)
>>>>> q3           0
>>>>>
>>>>> Any idea how this can be achieved?
>>>>>
>>>>> Thnx in advance
>>>>>
>>>>> -Sid
>>>>>
>>>>>
>>>>>
>>>>> DISCLAIMER
>>>>> ==========
>>>>> This e-mail may contain privileged and confidential information which
is the property of Persistent Systems Ltd. It is intended only for the use of the individual
or entity to which it is addressed. If you are not the intended recipient, you are not authorized
to read, retain, copy, print, distribute or use this message. If you have received this communication
in error, please notify the sender and delete all copies of this message. Persistent Systems
Ltd. does not accept any liability for virus infected mails.
>>
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is the property
of Persistent Systems Ltd. It is intended only for the use of the individual or entity to
which it is addressed. If you are not the intended recipient, you are not authorized to read,
retain, copy, print, distribute or use this message. If you have received this communication
in error, please notify the sender and delete all copies of this message. Persistent Systems
Ltd. does not accept any liability for virus infected mails.


Mime
View raw message