incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aniket Mokashi <aniket...@gmail.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Wed, 25 Apr 2012 21:50:34 GMT
Thanks Thejas!
https://issues.apache.org/jira/browse/PIG-2668

On Wed, Apr 25, 2012 at 2:04 PM, Thejas Nair <thejas@hortonworks.com> wrote:

> yes, please create one.
> Thanks,
> Thejas
>
>
> On 4/25/12 1:47 PM, Aniket Mokashi wrote:
>
>> Hi Dmitriy and Thejas,
>>
>> Should I open a jira for the same?
>>
>> Thanks,
>> Aniket
>>
>>
>> On Wed, Apr 25, 2012 at 1:45 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
>> <mailto:dvryaboy@gmail.com>> wrote:
>>
>>    Yeah I think we just need to get projection pushdown to work through
>>    Split operators.
>>
>>    D
>>
>>    On Wed, Apr 25, 2012 at 12:52 PM, Thejas Nair
>>    <thejas@hortonworks.com <mailto:thejas@hortonworks.com**>> wrote:
>>     > cc'ing dev@pig as this is a pig issue.
>>     >
>>     > Aniket, What you saw is not related to PIG-2339 .
>>     >
>>     > In your example query, the logical plan will look like this -
>>     >
>>     > Load (A)
>>     > |
>>     > Split
>>     >  |
>>     > ---------------------------
>>     > |             |
>>     > Filter(B1)   Filter(B2) ...
>>     >
>>     > Because of the split operator introduced between the filter
>>    conditions and
>>     > load, the filter does not get pushed into the load function.
>>     >
>>     > A simple way to fix this in pig would be to not share the load
>>    across the
>>     > filter operators. Another option is to push the condition (B1 or
>>    B2 or B3)
>>     > into Load operator and retain rest of the current plan (split and
>>    filters
>>     > following the split).
>>     >
>>     > You can ofcourse achieve the same effect by having a separate load
>>     > statememnt as input for each of the filters.
>>     >
>>     > I agree that we should make it possible to ask pig to throw a
>>    warning/error
>>     > if the query is going to result in a full table scan on a
>>    partitioned table.
>>     >
>>     > Thanks,
>>     > Thejas
>>     >
>>     >
>>     >
>>     >
>>     > On 4/24/12 7:56 PM, Aniket Mokashi wrote:
>>     >>
>>     >> Sorry Thejas, I didnt look into the jira properly earlier.
>>     >> EMR pig-0.9.1 already has that patch for PIG-2339 and hence I
>>    did not
>>     >> hit that issue earlier (and I patched datanucleus). filter-union
>>    was a
>>     >> workaround I was using to avoid some of the thrift timeout problems
>>     >> earlier. Thrift api's timeout on client side in 20sec by default (I
>>     >> found the config to change this later) and I hence used a = load
>>     >> 'table'; b1= filter by cond1; b2=filter by cond2;.. b= union b1,
>>    b2..;
>>     >> to expect to push these filters separately to the loader. But, that
>>     >> doesn't work in pig. (I can open a jira, but I havent done enough
>>     >> investigation at the code level). Thoughts?
>>     >>
>>     >> Thanks,
>>     >> Aniket
>>     >>
>>     >> On Tue, Apr 24, 2012 at 7:00 PM, Thejas Nair
>>    <thejas@hortonworks.com <mailto:thejas@hortonworks.com**>
>>     >> <mailto:thejas@hortonworks.com <mailto:thejas@hortonworks.com**>>>
>>
>>    wrote:
>>     >>
>>     >>    The issue was not specific to filter-union
>>     >>    - https://issues.apache.org/__**jira/browse/PIG-2339<https://issues.apache.org/__jira/browse/PIG-2339>
>>     >> <https://issues.apache.org/**jira/browse/PIG-2339<https://issues.apache.org/jira/browse/PIG-2339>
>> >.
>>     >>    The fix was to do filter PushUpFilter before
>>    PartitionFilterOptimizer .
>>     >>
>>     >>    As this is not a hcat issue, it should not matter if you have an
>>     >>    older hcat version .  fyi, this bug was not there in pig 0.8.x .
>>     >>    Was it pig 0.9.0 or 0.9.1 that you used ?
>>     >>
>>     >>    Thanks,
>>     >>    Thejas
>>     >>
>>     >>
>>     >>
>>     >>    On 4/24/12 5:21 PM, Aniket Mokashi wrote:
>>     >>
>>     >>        Hi Thejas,
>>     >>
>>     >>        Can you point me to jira that fixes filter-union problem
>>    (in pig)?
>>     >> I
>>     >>        haven't tried hcat-0.4 yet, good to know about that issue. I
>>     >>        will keep a
>>     >>        watcher.
>>     >>
>>     >>        Thanks,
>>     >>        Aniket
>>     >>
>>     >>        On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair
>>     >> <thejas@hortonworks.com <mailto:thejas@hortonworks.com**>
>>    <mailto:thejas@hortonworks.com <mailto:thejas@hortonworks.com**>>
>>     >> <mailto:thejas@hortonworks.com <mailto:thejas@hortonworks.com**>
>>     >> <mailto:thejas@hortonworks.com
>>    <mailto:thejas@hortonworks.com**>>__>> wrote:
>>     >>
>>     >>            Hi Aniket,
>>     >>            Are you using pig 0.9 or 0.9.1 ?
>>     >>            If yes, can you try with pig 0.9.2 ?
>>     >>            Wondering if you are also hitting the issue that Thomas
>>     >>        mentioned .
>>     >>
>>     >>            Thanks,
>>     >>            Thejas
>>     >>
>>     >>
>>     >>
>>     >>
>>     >>            On 4/23/12 7:39 PM, Aniket Mokashi wrote:
>>     >>
>>     >>                Something similar I have noticed is -
>>     >>
>>     >>                A = load ...
>>     >>                B1 = filter A by cond1;
>>     >>                B2 = filter A by cond2;
>>     >>                B3 = filter A by cond3;
>>     >>
>>     >>                B = union B1, B2, B3; does not push projection.
>>     >>
>>     >>                Is that expected?
>>     >>
>>     >>                Ideally, we should have "strict" mode under
>> hcatalog,
>>     >>        that when
>>     >>                turned
>>     >>                on will avoid executing pig queries on the full
>>     >>        (partitioned) table.
>>     >>
>>     >>                Thanks,
>>     >>                Aniket
>>     >>
>>     >>                On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
>>     >> <rajesh.balamohan@gmail.com <mailto:rajesh.balamohan@**gmail.com<rajesh.balamohan@gmail.com>
>> >
>>    <mailto:rajesh.balamohan@**gmail.com <rajesh.balamohan@gmail.com><mailto:
>> rajesh.balamohan@**gmail.com <rajesh.balamohan@gmail.com>>>
>>     >> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>__gm**ail.com<http://gmail.com>
>>    <http://gmail.com>
>>     >> <mailto:rajesh.balamohan@**gmail.com <rajesh.balamohan@gmail.com>
>>
>>    <mailto:rajesh.balamohan@**gmail.com <rajesh.balamohan@gmail.com>>>>
>>     >> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>
>>     >> <mailto:rajesh.balamohan@
>>    <mailto:rajesh.balamohan@>>__g**ma__il.com <http://gma__il.com>
<
>> http://gma__il.com>
>>    <http://gmail.com>
>>     >>
>>     >> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>__gm**ail.com<http://gmail.com>
>>    <http://gmail.com>
>>     >> <mailto:rajesh.balamohan@**gmail.com <rajesh.balamohan@gmail.com>
>>
>>    <mailto:rajesh.balamohan@**gmail.com <rajesh.balamohan@gmail.com>>>>>>
>> wrote:
>>     >>
>>     >>                    Hi Alan,
>>     >>
>>     >>                    Thanks for the quick response.
>>     >>
>>     >>                    I am using HCatalog 0.4.
>>     >>
>>     >>                    With simple PIG script it works great. HCatalog
>>     >>        beautifully
>>     >>                scans
>>     >>                    only the relevant information. However, full
>> scan
>>     >>        happens
>>     >>                only when
>>     >>                    we have couple of additional joins and when we
>>     >>        change the
>>     >>                INNER JOIN
>>     >>                    order (we also use "using skewed").
>>     >>
>>     >>                    Though we have looked into the debug logs, we
>>    saw the
>>     >>                scanning of
>>     >>                    number of records from the JobTracker's counters
>>     >>        itself. Without
>>     >>                    pruning, the m/r job was pretty much scanning
>> the
>>     >>        entire set
>>     >>                of rows.
>>     >>
>>     >>                    I am not sure if there is a corner case, where
>> in
>>     >> "skewed"
>>     >>                join is
>>     >>                    trying to override the filtering.
>>     >>
>>     >>                    ~Rajesh.B
>>     >>
>>     >>
>>     >>
>>     >>                    On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates
>>     >> <gates@hortonworks.com <mailto:gates@hortonworks.com>
>>    <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>**>
>>     >> <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>
>>    <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>**>>
>>     >> <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>
>>    <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>**>
>>     >> <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>
>>    <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>**>>__>__>
>>     >>
>>     >>                wrote:
>>     >>
>>     >>                        What version of HCatalog are you using?
>>      How do
>>     >>        you know
>>     >>                it is
>>     >>                        scanning all the partitions, does it say
>>    so in
>>     >>        the logs,
>>     >>                or are
>>     >>                        you getting all the records back?
>>     >>
>>     >>                        And yes, HCat is supposed to do partition
>>     >>        pruning so that it
>>     >>                        only scans the required partitions.
>>     >>
>>     >>                        Alan.
>>     >>
>>     >>                        On Apr 21, 2012, at 8:27 PM, Rajesh
>> Balamohan
>>     >> wrote:
>>     >>
>>     >> > Hi All,
>>     >> >
>>     >> > I have a hcatalog table "partitioned by (d string)".
>>     >> >
>>     >> > I have couple of days worth of data and when i run "show
>>     >>                        partitions" it provides the correct daa.
>>     >> >
>>     >> > d=20111215
>>     >> > d=20111216
>>     >> > d=20111217
>>     >> > d=20111218
>>     >> > d=20111219
>>     >> > d=20111220
>>     >> > d=20111221
>>     >> > d=20111222
>>     >> > d=20111223
>>     >> > d=20111224
>>     >> > d=20111225
>>     >> > d=20120415
>>     >> >
>>     >> > However, when I run PIG with "filter a by d == '20120415'",
>>     >>                        it ends up scanning all data.
>>     >> >
>>     >> > Is this a known bug/enhancement in HCatalog?. Ideally,
>>     >>                        shouldn't it scan only the d=20120415
>>    directory?
>>     >> >
>>     >> > Any pointers would be of great help.
>>     >> >
>>     >> >
>>     >> > --
>>     >> > ~Rajesh.B
>>     >>
>>     >>
>>     >>
>>     >>
>>     >>                    --
>>     >>                    ~Rajesh.B
>>     >>
>>     >>
>>     >>
>>     >>
>>     >>                --
>>     >> "...:::Aniket:::... Quetzalco@tl"
>>     >>
>>     >>
>>     >>
>>     >>
>>     >>
>>     >>        --
>>     >> "...:::Aniket:::... Quetzalco@tl"
>>     >>
>>     >>
>>     >>
>>     >>
>>     >>
>>     >> --
>>     >> "...:::Aniket:::... Quetzalco@tl"
>>     >
>>     >
>>
>>
>>
>>
>> --
>> "...:::Aniket:::... Quetzalco@tl"
>>
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Mime
View raw message