incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thejas Nair <the...@hortonworks.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Wed, 25 Apr 2012 21:04:48 GMT
yes, please create one.
Thanks,
Thejas

On 4/25/12 1:47 PM, Aniket Mokashi wrote:
> Hi Dmitriy and Thejas,
>
> Should I open a jira for the same?
>
> Thanks,
> Aniket
>
>
> On Wed, Apr 25, 2012 at 1:45 PM, Dmitriy Ryaboy <dvryaboy@gmail.com
> <mailto:dvryaboy@gmail.com>> wrote:
>
>     Yeah I think we just need to get projection pushdown to work through
>     Split operators.
>
>     D
>
>     On Wed, Apr 25, 2012 at 12:52 PM, Thejas Nair
>     <thejas@hortonworks.com <mailto:thejas@hortonworks.com>> wrote:
>      > cc'ing dev@pig as this is a pig issue.
>      >
>      > Aniket, What you saw is not related to PIG-2339 .
>      >
>      > In your example query, the logical plan will look like this -
>      >
>      > Load (A)
>      > |
>      > Split
>      >  |
>      > ---------------------------
>      > |             |
>      > Filter(B1)   Filter(B2) ...
>      >
>      > Because of the split operator introduced between the filter
>     conditions and
>      > load, the filter does not get pushed into the load function.
>      >
>      > A simple way to fix this in pig would be to not share the load
>     across the
>      > filter operators. Another option is to push the condition (B1 or
>     B2 or B3)
>      > into Load operator and retain rest of the current plan (split and
>     filters
>      > following the split).
>      >
>      > You can ofcourse achieve the same effect by having a separate load
>      > statememnt as input for each of the filters.
>      >
>      > I agree that we should make it possible to ask pig to throw a
>     warning/error
>      > if the query is going to result in a full table scan on a
>     partitioned table.
>      >
>      > Thanks,
>      > Thejas
>      >
>      >
>      >
>      >
>      > On 4/24/12 7:56 PM, Aniket Mokashi wrote:
>      >>
>      >> Sorry Thejas, I didnt look into the jira properly earlier.
>      >> EMR pig-0.9.1 already has that patch for PIG-2339 and hence I
>     did not
>      >> hit that issue earlier (and I patched datanucleus). filter-union
>     was a
>      >> workaround I was using to avoid some of the thrift timeout problems
>      >> earlier. Thrift api's timeout on client side in 20sec by default (I
>      >> found the config to change this later) and I hence used a = load
>      >> 'table'; b1= filter by cond1; b2=filter by cond2;.. b= union b1,
>     b2..;
>      >> to expect to push these filters separately to the loader. But, that
>      >> doesn't work in pig. (I can open a jira, but I havent done enough
>      >> investigation at the code level). Thoughts?
>      >>
>      >> Thanks,
>      >> Aniket
>      >>
>      >> On Tue, Apr 24, 2012 at 7:00 PM, Thejas Nair
>     <thejas@hortonworks.com <mailto:thejas@hortonworks.com>
>      >> <mailto:thejas@hortonworks.com <mailto:thejas@hortonworks.com>>>
>     wrote:
>      >>
>      >>    The issue was not specific to filter-union
>      >>    - https://issues.apache.org/__jira/browse/PIG-2339
>      >> <https://issues.apache.org/jira/browse/PIG-2339>.
>      >>    The fix was to do filter PushUpFilter before
>     PartitionFilterOptimizer .
>      >>
>      >>    As this is not a hcat issue, it should not matter if you have an
>      >>    older hcat version .  fyi, this bug was not there in pig 0.8.x .
>      >>    Was it pig 0.9.0 or 0.9.1 that you used ?
>      >>
>      >>    Thanks,
>      >>    Thejas
>      >>
>      >>
>      >>
>      >>    On 4/24/12 5:21 PM, Aniket Mokashi wrote:
>      >>
>      >>        Hi Thejas,
>      >>
>      >>        Can you point me to jira that fixes filter-union problem
>     (in pig)?
>      >> I
>      >>        haven't tried hcat-0.4 yet, good to know about that issue. I
>      >>        will keep a
>      >>        watcher.
>      >>
>      >>        Thanks,
>      >>        Aniket
>      >>
>      >>        On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair
>      >> <thejas@hortonworks.com <mailto:thejas@hortonworks.com>
>     <mailto:thejas@hortonworks.com <mailto:thejas@hortonworks.com>>
>      >> <mailto:thejas@hortonworks.com <mailto:thejas@hortonworks.com>
>      >> <mailto:thejas@hortonworks.com
>     <mailto:thejas@hortonworks.com>>__>> wrote:
>      >>
>      >>            Hi Aniket,
>      >>            Are you using pig 0.9 or 0.9.1 ?
>      >>            If yes, can you try with pig 0.9.2 ?
>      >>            Wondering if you are also hitting the issue that Thomas
>      >>        mentioned .
>      >>
>      >>            Thanks,
>      >>            Thejas
>      >>
>      >>
>      >>
>      >>
>      >>            On 4/23/12 7:39 PM, Aniket Mokashi wrote:
>      >>
>      >>                Something similar I have noticed is -
>      >>
>      >>                A = load ...
>      >>                B1 = filter A by cond1;
>      >>                B2 = filter A by cond2;
>      >>                B3 = filter A by cond3;
>      >>
>      >>                B = union B1, B2, B3; does not push projection.
>      >>
>      >>                Is that expected?
>      >>
>      >>                Ideally, we should have "strict" mode under hcatalog,
>      >>        that when
>      >>                turned
>      >>                on will avoid executing pig queries on the full
>      >>        (partitioned) table.
>      >>
>      >>                Thanks,
>      >>                Aniket
>      >>
>      >>                On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
>      >> <rajesh.balamohan@gmail.com <mailto:rajesh.balamohan@gmail.com>
>     <mailto:rajesh.balamohan@gmail.com <mailto:rajesh.balamohan@gmail.com>>
>      >> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>__gmail.com
>     <http://gmail.com>
>      >> <mailto:rajesh.balamohan@gmail.com
>     <mailto:rajesh.balamohan@gmail.com>>>
>      >> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>
>      >> <mailto:rajesh.balamohan@
>     <mailto:rajesh.balamohan@>>__gma__il.com <http://gma__il.com>
>     <http://gmail.com>
>      >>
>      >> <mailto:rajesh.balamohan@ <mailto:rajesh.balamohan@>__gmail.com
>     <http://gmail.com>
>      >> <mailto:rajesh.balamohan@gmail.com
>     <mailto:rajesh.balamohan@gmail.com>>>>> wrote:
>      >>
>      >>                    Hi Alan,
>      >>
>      >>                    Thanks for the quick response.
>      >>
>      >>                    I am using HCatalog 0.4.
>      >>
>      >>                    With simple PIG script it works great. HCatalog
>      >>        beautifully
>      >>                scans
>      >>                    only the relevant information. However, full scan
>      >>        happens
>      >>                only when
>      >>                    we have couple of additional joins and when we
>      >>        change the
>      >>                INNER JOIN
>      >>                    order (we also use "using skewed").
>      >>
>      >>                    Though we have looked into the debug logs, we
>     saw the
>      >>                scanning of
>      >>                    number of records from the JobTracker's counters
>      >>        itself. Without
>      >>                    pruning, the m/r job was pretty much scanning the
>      >>        entire set
>      >>                of rows.
>      >>
>      >>                    I am not sure if there is a corner case, where in
>      >> "skewed"
>      >>                join is
>      >>                    trying to override the filtering.
>      >>
>      >>                    ~Rajesh.B
>      >>
>      >>
>      >>
>      >>                    On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates
>      >> <gates@hortonworks.com <mailto:gates@hortonworks.com>
>     <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>>
>      >> <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>
>     <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>>>
>      >> <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>
>     <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>>
>      >> <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>
>     <mailto:gates@hortonworks.com <mailto:gates@hortonworks.com>>>__>__>
>      >>
>      >>                wrote:
>      >>
>      >>                        What version of HCatalog are you using?
>       How do
>      >>        you know
>      >>                it is
>      >>                        scanning all the partitions, does it say
>     so in
>      >>        the logs,
>      >>                or are
>      >>                        you getting all the records back?
>      >>
>      >>                        And yes, HCat is supposed to do partition
>      >>        pruning so that it
>      >>                        only scans the required partitions.
>      >>
>      >>                        Alan.
>      >>
>      >>                        On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan
>      >> wrote:
>      >>
>      >> > Hi All,
>      >> >
>      >> > I have a hcatalog table "partitioned by (d string)".
>      >> >
>      >> > I have couple of days worth of data and when i run "show
>      >>                        partitions" it provides the correct daa.
>      >> >
>      >> > d=20111215
>      >> > d=20111216
>      >> > d=20111217
>      >> > d=20111218
>      >> > d=20111219
>      >> > d=20111220
>      >> > d=20111221
>      >> > d=20111222
>      >> > d=20111223
>      >> > d=20111224
>      >> > d=20111225
>      >> > d=20120415
>      >> >
>      >> > However, when I run PIG with "filter a by d == '20120415'",
>      >>                        it ends up scanning all data.
>      >> >
>      >> > Is this a known bug/enhancement in HCatalog?. Ideally,
>      >>                        shouldn't it scan only the d=20120415
>     directory?
>      >> >
>      >> > Any pointers would be of great help.
>      >> >
>      >> >
>      >> > --
>      >> > ~Rajesh.B
>      >>
>      >>
>      >>
>      >>
>      >>                    --
>      >>                    ~Rajesh.B
>      >>
>      >>
>      >>
>      >>
>      >>                --
>      >> "...:::Aniket:::... Quetzalco@tl"
>      >>
>      >>
>      >>
>      >>
>      >>
>      >>        --
>      >> "...:::Aniket:::... Quetzalco@tl"
>      >>
>      >>
>      >>
>      >>
>      >>
>      >> --
>      >> "...:::Aniket:::... Quetzalco@tl"
>      >
>      >
>
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"


Mime
View raw message