incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aniket Mokashi <aniket...@gmail.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Tue, 24 Apr 2012 02:39:06 GMT
Something similar I have noticed is -

A = load ...
B1 = filter A by cond1;
B2 = filter A by cond2;
B3 = filter A by cond3;

B = union B1, B2, B3; does not push projection.

Is that expected?

Ideally, we should have "strict" mode under hcatalog, that when turned on
will avoid executing pig queries on the full (partitioned) table.

Thanks,
Aniket

On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan <
rajesh.balamohan@gmail.com> wrote:

> Hi Alan,
>
> Thanks for the quick response.
>
> I am using HCatalog 0.4.
>
> With simple PIG script it works great. HCatalog beautifully scans only the
> relevant information. However, full scan happens only when we have couple
> of additional joins and when we change the INNER JOIN order (we also use
> "using skewed").
>
> Though we have looked into the debug logs, we saw the scanning of number
> of records from the JobTracker's counters itself. Without pruning, the m/r
> job was pretty much scanning the entire set of rows.
>
> I am not sure if there is a corner case, where in "skewed" join is trying
> to override the filtering.
>
> ~Rajesh.B
>
>
>
> On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates <gates@hortonworks.com> wrote:
>
>> What version of HCatalog are you using?  How do you know it is scanning
>> all the partitions, does it say so in the logs, or are you getting all the
>> records back?
>>
>> And yes, HCat is supposed to do partition pruning so that it only scans
>> the required partitions.
>>
>> Alan.
>>
>> On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan wrote:
>>
>> > Hi All,
>> >
>> > I have a hcatalog table "partitioned by (d string)".
>> >
>> > I have couple of days worth of data and when i run "show partitions" it
>> provides the correct daa.
>> >
>> > d=20111215
>> > d=20111216
>> > d=20111217
>> > d=20111218
>> > d=20111219
>> > d=20111220
>> > d=20111221
>> > d=20111222
>> > d=20111223
>> > d=20111224
>> > d=20111225
>> > d=20120415
>> >
>> > However, when I run PIG with "filter a by d == '20120415'", it ends up
>> scanning all data.
>> >
>> > Is this a known bug/enhancement in HCatalog?. Ideally, shouldn't it
>> scan only the d=20120415 directory?
>> >
>> > Any pointers would be of great help.
>> >
>> >
>> > --
>> > ~Rajesh.B
>>
>>
>
>
> --
> ~Rajesh.B
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Mime
View raw message