Sorry Thejas, I didnt look into the jira properly earlier. 
EMR pig-0.9.1 already has that patch for PIG-2339 and hence I did not hit that issue earlier (and I patched datanucleus). filter-union was a workaround I was using to avoid some of the thrift timeout problems earlier. Thrift api's timeout on client side in 20sec by default (I found the config to change this later) and I hence used a = load 'table'; b1= filter by cond1; b2=filter by cond2;.. b= union b1, b2..; to expect to push these filters separately to the loader. But, that doesn't work in pig. (I can open a jira, but I havent done enough investigation at the code level). Thoughts?


The issue was not specific to filter-union
The fix was to do filter PushUpFilter before PartitionFilterOptimizer .

As this is not a hcat issue, it should not matter if you have an older hcat version .  fyi, this bug was not there in pig 0.8.x .
Was it pig 0.9.0 or 0.9.1 that you used ?


Can you point me to jira that fixes filter-union problem (in pig)? I
haven't tried hcat-0.4 yet, good to know about that issue. I will keep a


   Are you using pig 0.9 or 0.9.1 ?
   If yes, can you try with pig 0.9.2 ?
   Wondering if you are also hitting the issue that Thomas mentioned .


       Something similar I have noticed is -

       A = load ...
       B1 = filter A by cond1;
       B2 = filter A by cond2;
       B3 = filter A by cond3;

       B = union B1, B2, B3; does not push projection.

       Is that expected?

       Ideally, we should have "strict" mode under hcatalog, that when
       on will avoid executing pig queries on the full (partitioned) table.


           I am using HCatalog 0.4.

           With simple PIG script it works great. HCatalog beautifully
           only the relevant information. However, full scan happens
       only when
           we have couple of additional joins and when we change the
           order (we also use "using skewed").

           Though we have looked into the debug logs, we saw the
       scanning of
           number of records from the JobTracker's counters itself. Without
           pruning, the m/r job was pretty much scanning the entire set
       of rows.

           I am not sure if there is a corner case, where in "skewed"
       join is
           trying to override the filtering.


               What version of HCatalog are you using?  How do you know
       it is
               scanning all the partitions, does it say so in the logs,
       or are
               you getting all the records back?

               And yes, HCat is supposed to do partition pruning so that it
               only scans the required partitions.


        > Hi All,
        > I have a hcatalog table "partitioned by (d string)".
        > I have couple of days worth of data and when i run "show
               partitions" it provides the correct daa.
        > d=20111215
        > d=20111216
        > d=20111217
        > d=20111218
        > d=20111219
        > d=20111220
        > d=20111221
        > d=20111222
        > d=20111223
        > d=20111224
        > d=20111225
        > d=20120415
        > However, when I run PIG with "filter a by d == '20120415'",
               it ends up scanning all data.
        > Is this a known bug/enhancement in HCatalog?. Ideally,
               shouldn't it scan only the d=20120415 directory?
        > Any pointers would be of great help.
