incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aniket Mokashi <aniket...@gmail.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Wed, 25 Apr 2012 02:56:25 GMT
Sorry Thejas, I didnt look into the jira properly earlier.
EMR pig-0.9.1 already has that patch for PIG-2339 and hence I did not hit
that issue earlier (and I patched datanucleus). filter-union was a
workaround I was using to avoid some of the thrift timeout problems
earlier. Thrift api's timeout on client side in 20sec by default (I found
the config to change this later) and I hence used a = load 'table'; b1=
filter by cond1; b2=filter by cond2;.. b= union b1, b2..; to expect to push
these filters separately to the loader. But, that doesn't work in pig. (I
can open a jira, but I havent done enough investigation at the code level).
Thoughts?

Thanks,
Aniket

On Tue, Apr 24, 2012 at 7:00 PM, Thejas Nair <thejas@hortonworks.com> wrote:

> The issue was not specific to filter-union
> - https://issues.apache.org/**jira/browse/PIG-2339<https://issues.apache.org/jira/browse/PIG-2339>
> .
> The fix was to do filter PushUpFilter before PartitionFilterOptimizer .
>
> As this is not a hcat issue, it should not matter if you have an older
> hcat version .  fyi, this bug was not there in pig 0.8.x .
> Was it pig 0.9.0 or 0.9.1 that you used ?
>
> Thanks,
> Thejas
>
>
>
> On 4/24/12 5:21 PM, Aniket Mokashi wrote:
>
>> Hi Thejas,
>>
>> Can you point me to jira that fixes filter-union problem (in pig)? I
>> haven't tried hcat-0.4 yet, good to know about that issue. I will keep a
>> watcher.
>>
>> Thanks,
>> Aniket
>>
>> On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair <thejas@hortonworks.com
>> <mailto:thejas@hortonworks.com**>> wrote:
>>
>>    Hi Aniket,
>>    Are you using pig 0.9 or 0.9.1 ?
>>    If yes, can you try with pig 0.9.2 ?
>>    Wondering if you are also hitting the issue that Thomas mentioned .
>>
>>    Thanks,
>>    Thejas
>>
>>
>>
>>
>>    On 4/23/12 7:39 PM, Aniket Mokashi wrote:
>>
>>        Something similar I have noticed is -
>>
>>        A = load ...
>>        B1 = filter A by cond1;
>>        B2 = filter A by cond2;
>>        B3 = filter A by cond3;
>>
>>        B = union B1, B2, B3; does not push projection.
>>
>>        Is that expected?
>>
>>        Ideally, we should have "strict" mode under hcatalog, that when
>>        turned
>>        on will avoid executing pig queries on the full (partitioned)
>> table.
>>
>>        Thanks,
>>        Aniket
>>
>>        On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
>>        <rajesh.balamohan@gmail.com <mailto:rajesh.balamohan@**gmail.com<rajesh.balamohan@gmail.com>
>> >
>>        <mailto:rajesh.balamohan@__gma**il.com <http://gmail.com>
>>
>>        <mailto:rajesh.balamohan@**gmail.com <rajesh.balamohan@gmail.com>>>>
>> wrote:
>>
>>            Hi Alan,
>>
>>            Thanks for the quick response.
>>
>>            I am using HCatalog 0.4.
>>
>>            With simple PIG script it works great. HCatalog beautifully
>>        scans
>>            only the relevant information. However, full scan happens
>>        only when
>>            we have couple of additional joins and when we change the
>>        INNER JOIN
>>            order (we also use "using skewed").
>>
>>            Though we have looked into the debug logs, we saw the
>>        scanning of
>>            number of records from the JobTracker's counters itself.
>> Without
>>            pruning, the m/r job was pretty much scanning the entire set
>>        of rows.
>>
>>            I am not sure if there is a corner case, where in "skewed"
>>        join is
>>            trying to override the filtering.
>>
>>            ~Rajesh.B
>>
>>
>>
>>            On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates
>>        <gates@hortonworks.com <mailto:gates@hortonworks.com>
>>        <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>**>__>
>>
>>        wrote:
>>
>>                What version of HCatalog are you using?  How do you know
>>        it is
>>                scanning all the partitions, does it say so in the logs,
>>        or are
>>                you getting all the records back?
>>
>>                And yes, HCat is supposed to do partition pruning so that
>> it
>>                only scans the required partitions.
>>
>>                Alan.
>>
>>                On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan wrote:
>>
>>         > Hi All,
>>         >
>>         > I have a hcatalog table "partitioned by (d string)".
>>         >
>>         > I have couple of days worth of data and when i run "show
>>                partitions" it provides the correct daa.
>>         >
>>         > d=20111215
>>         > d=20111216
>>         > d=20111217
>>         > d=20111218
>>         > d=20111219
>>         > d=20111220
>>         > d=20111221
>>         > d=20111222
>>         > d=20111223
>>         > d=20111224
>>         > d=20111225
>>         > d=20120415
>>         >
>>         > However, when I run PIG with "filter a by d == '20120415'",
>>                it ends up scanning all data.
>>         >
>>         > Is this a known bug/enhancement in HCatalog?. Ideally,
>>                shouldn't it scan only the d=20120415 directory?
>>         >
>>         > Any pointers would be of great help.
>>         >
>>         >
>>         > --
>>         > ~Rajesh.B
>>
>>
>>
>>
>>            --
>>            ~Rajesh.B
>>
>>
>>
>>
>>        --
>>        "...:::Aniket:::... Quetzalco@tl"
>>
>>
>>
>>
>>
>> --
>> "...:::Aniket:::... Quetzalco@tl"
>>
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Mime
View raw message