incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejas Nair <the...@hortonworks.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Wed, 25 Apr 2012 02:00:48 GMT
The issue was not specific to filter-union
- https://issues.apache.org/jira/browse/PIG-2339.
The fix was to do filter PushUpFilter before PartitionFilterOptimizer .

As this is not a hcat issue, it should not matter if you have an older 
hcat version .  fyi, this bug was not there in pig 0.8.x .
Was it pig 0.9.0 or 0.9.1 that you used ?

Thanks,
Thejas


On 4/24/12 5:21 PM, Aniket Mokashi wrote:
> Hi Thejas,
>
> Can you point me to jira that fixes filter-union problem (in pig)? I
> haven't tried hcat-0.4 yet, good to know about that issue. I will keep a
> watcher.
>
> Thanks,
> Aniket
>
> On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair <thejas@hortonworks.com
> <mailto:thejas@hortonworks.com>> wrote:
>
>     Hi Aniket,
>     Are you using pig 0.9 or 0.9.1 ?
>     If yes, can you try with pig 0.9.2 ?
>     Wondering if you are also hitting the issue that Thomas mentioned .
>
>     Thanks,
>     Thejas
>
>
>
>
>     On 4/23/12 7:39 PM, Aniket Mokashi wrote:
>
>         Something similar I have noticed is -
>
>         A = load ...
>         B1 = filter A by cond1;
>         B2 = filter A by cond2;
>         B3 = filter A by cond3;
>
>         B = union B1, B2, B3; does not push projection.
>
>         Is that expected?
>
>         Ideally, we should have "strict" mode under hcatalog, that when
>         turned
>         on will avoid executing pig queries on the full (partitioned) table.
>
>         Thanks,
>         Aniket
>
>         On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
>         <rajesh.balamohan@gmail.com <mailto:rajesh.balamohan@gmail.com>
>         <mailto:rajesh.balamohan@__gmail.com
>         <mailto:rajesh.balamohan@gmail.com>>> wrote:
>
>             Hi Alan,
>
>             Thanks for the quick response.
>
>             I am using HCatalog 0.4.
>
>             With simple PIG script it works great. HCatalog beautifully
>         scans
>             only the relevant information. However, full scan happens
>         only when
>             we have couple of additional joins and when we change the
>         INNER JOIN
>             order (we also use "using skewed").
>
>             Though we have looked into the debug logs, we saw the
>         scanning of
>             number of records from the JobTracker's counters itself. Without
>             pruning, the m/r job was pretty much scanning the entire set
>         of rows.
>
>             I am not sure if there is a corner case, where in "skewed"
>         join is
>             trying to override the filtering.
>
>             ~Rajesh.B
>
>
>
>             On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates
>         <gates@hortonworks.com <mailto:gates@hortonworks.com>
>         <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>>__>
>         wrote:
>
>                 What version of HCatalog are you using?  How do you know
>         it is
>                 scanning all the partitions, does it say so in the logs,
>         or are
>                 you getting all the records back?
>
>                 And yes, HCat is supposed to do partition pruning so that it
>                 only scans the required partitions.
>
>                 Alan.
>
>                 On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan wrote:
>
>          > Hi All,
>          >
>          > I have a hcatalog table "partitioned by (d string)".
>          >
>          > I have couple of days worth of data and when i run "show
>                 partitions" it provides the correct daa.
>          >
>          > d=20111215
>          > d=20111216
>          > d=20111217
>          > d=20111218
>          > d=20111219
>          > d=20111220
>          > d=20111221
>          > d=20111222
>          > d=20111223
>          > d=20111224
>          > d=20111225
>          > d=20120415
>          >
>          > However, when I run PIG with "filter a by d == '20120415'",
>                 it ends up scanning all data.
>          >
>          > Is this a known bug/enhancement in HCatalog?. Ideally,
>                 shouldn't it scan only the d=20120415 directory?
>          >
>          > Any pointers would be of great help.
>          >
>          >
>          > --
>          > ~Rajesh.B
>
>
>
>
>             --
>             ~Rajesh.B
>
>
>
>
>         --
>         "...:::Aniket:::... Quetzalco@tl"
>
>
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"


Mime
View raw message