incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aniket Mokashi <aniket...@gmail.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Wed, 25 Apr 2012 00:21:02 GMT
Hi Thejas,

Can you point me to jira that fixes filter-union problem (in pig)? I
haven't tried hcat-0.4 yet, good to know about that issue. I will keep a
watcher.

Thanks,
Aniket

On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair <thejas@hortonworks.com> wrote:

> Hi Aniket,
> Are you using pig 0.9 or 0.9.1 ?
> If yes, can you try with pig 0.9.2 ?
> Wondering if you are also hitting the issue that Thomas mentioned .
>
> Thanks,
> Thejas
>
>
>
>
> On 4/23/12 7:39 PM, Aniket Mokashi wrote:
>
>> Something similar I have noticed is -
>>
>> A = load ...
>> B1 = filter A by cond1;
>> B2 = filter A by cond2;
>> B3 = filter A by cond3;
>>
>> B = union B1, B2, B3; does not push projection.
>>
>> Is that expected?
>>
>> Ideally, we should have "strict" mode under hcatalog, that when turned
>> on will avoid executing pig queries on the full (partitioned) table.
>>
>> Thanks,
>> Aniket
>>
>> On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
>> <rajesh.balamohan@gmail.com <mailto:rajesh.balamohan@**gmail.com<rajesh.balamohan@gmail.com>>>
>> wrote:
>>
>>    Hi Alan,
>>
>>    Thanks for the quick response.
>>
>>    I am using HCatalog 0.4.
>>
>>    With simple PIG script it works great. HCatalog beautifully scans
>>    only the relevant information. However, full scan happens only when
>>    we have couple of additional joins and when we change the INNER JOIN
>>    order (we also use "using skewed").
>>
>>    Though we have looked into the debug logs, we saw the scanning of
>>    number of records from the JobTracker's counters itself. Without
>>    pruning, the m/r job was pretty much scanning the entire set of rows.
>>
>>    I am not sure if there is a corner case, where in "skewed" join is
>>    trying to override the filtering.
>>
>>    ~Rajesh.B
>>
>>
>>
>>    On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates <gates@hortonworks.com
>>    <mailto:gates@hortonworks.com>**> wrote:
>>
>>        What version of HCatalog are you using?  How do you know it is
>>        scanning all the partitions, does it say so in the logs, or are
>>        you getting all the records back?
>>
>>        And yes, HCat is supposed to do partition pruning so that it
>>        only scans the required partitions.
>>
>>        Alan.
>>
>>        On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan wrote:
>>
>>         > Hi All,
>>         >
>>         > I have a hcatalog table "partitioned by (d string)".
>>         >
>>         > I have couple of days worth of data and when i run "show
>>        partitions" it provides the correct daa.
>>         >
>>         > d=20111215
>>         > d=20111216
>>         > d=20111217
>>         > d=20111218
>>         > d=20111219
>>         > d=20111220
>>         > d=20111221
>>         > d=20111222
>>         > d=20111223
>>         > d=20111224
>>         > d=20111225
>>         > d=20120415
>>         >
>>         > However, when I run PIG with "filter a by d == '20120415'",
>>        it ends up scanning all data.
>>         >
>>         > Is this a known bug/enhancement in HCatalog?. Ideally,
>>        shouldn't it scan only the d=20120415 directory?
>>         >
>>         > Any pointers would be of great help.
>>         >
>>         >
>>         > --
>>         > ~Rajesh.B
>>
>>
>>
>>
>>    --
>>    ~Rajesh.B
>>
>>
>>
>>
>> --
>> "...:::Aniket:::... Quetzalco@tl"
>>
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Mime
View raw message