incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejas Nair <the...@hortonworks.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Tue, 24 Apr 2012 23:51:16 GMT
Hi Aniket,
Are you using pig 0.9 or 0.9.1 ?
If yes, can you try with pig 0.9.2 ?
Wondering if you are also hitting the issue that Thomas mentioned .

Thanks,
Thejas



On 4/23/12 7:39 PM, Aniket Mokashi wrote:
> Something similar I have noticed is -
>
> A = load ...
> B1 = filter A by cond1;
> B2 = filter A by cond2;
> B3 = filter A by cond3;
>
> B = union B1, B2, B3; does not push projection.
>
> Is that expected?
>
> Ideally, we should have "strict" mode under hcatalog, that when turned
> on will avoid executing pig queries on the full (partitioned) table.
>
> Thanks,
> Aniket
>
> On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
> <rajesh.balamohan@gmail.com <mailto:rajesh.balamohan@gmail.com>> wrote:
>
>     Hi Alan,
>
>     Thanks for the quick response.
>
>     I am using HCatalog 0.4.
>
>     With simple PIG script it works great. HCatalog beautifully scans
>     only the relevant information. However, full scan happens only when
>     we have couple of additional joins and when we change the INNER JOIN
>     order (we also use "using skewed").
>
>     Though we have looked into the debug logs, we saw the scanning of
>     number of records from the JobTracker's counters itself. Without
>     pruning, the m/r job was pretty much scanning the entire set of rows.
>
>     I am not sure if there is a corner case, where in "skewed" join is
>     trying to override the filtering.
>
>     ~Rajesh.B
>
>
>
>     On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates <gates@hortonworks.com
>     <mailto:gates@hortonworks.com>> wrote:
>
>         What version of HCatalog are you using?  How do you know it is
>         scanning all the partitions, does it say so in the logs, or are
>         you getting all the records back?
>
>         And yes, HCat is supposed to do partition pruning so that it
>         only scans the required partitions.
>
>         Alan.
>
>         On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan wrote:
>
>          > Hi All,
>          >
>          > I have a hcatalog table "partitioned by (d string)".
>          >
>          > I have couple of days worth of data and when i run "show
>         partitions" it provides the correct daa.
>          >
>          > d=20111215
>          > d=20111216
>          > d=20111217
>          > d=20111218
>          > d=20111219
>          > d=20111220
>          > d=20111221
>          > d=20111222
>          > d=20111223
>          > d=20111224
>          > d=20111225
>          > d=20120415
>          >
>          > However, when I run PIG with "filter a by d == '20120415'",
>         it ends up scanning all data.
>          >
>          > Is this a known bug/enhancement in HCatalog?. Ideally,
>         shouldn't it scan only the d=20120415 directory?
>          >
>          > Any pointers would be of great help.
>          >
>          >
>          > --
>          > ~Rajesh.B
>
>
>
>
>     --
>     ~Rajesh.B
>
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"


Mime
View raw message