incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajesh Balamohan <rajesh.balamo...@gmail.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Tue, 24 Apr 2012 23:39:03 GMT
Thanks a lot.
I was using 0.9 version.

I tried with 0.9.3 and its able to scan only relevant data.

Thanks again.
On Apr 25, 2012 4:30 AM, "Thomas Weise" <thw@yahoo-inc.com> wrote:

> There was a defect in Pig that was fixed for 0.9.2 that would cause the
> partition filter to not be available to Hcat:
>
> https://issues.apache.org/jira/browse/PIG-2339
>
> Which pig version are you using?
>
> Thomas
>
>
> On 4/23/12 8:45 PM, "Alan Gates" <gates@hortonworks.com> wrote:
>
> > If possible could you share a pig latin script that scans only the proper
> > partitions and one that scans everything?  That would help us see what
> the
> > issue is.
> >
> > Alan.
> >
> > On Apr 23, 2012, at 7:32 PM, Rajesh Balamohan wrote:
> >
> >> Hi Alan,
> >>
> >> Thanks for the quick response.
> >>
> >> I am using HCatalog 0.4.
> >>
> >> With simple PIG script it works great. HCatalog beautifully scans only
> the
> >> relevant information. However, full scan happens only when we have
> couple of
> >> additional joins and when we change the INNER JOIN order (we also use
> "using
> >> skewed").
> >>
> >> Though we have looked into the debug logs, we saw the scanning of
> number of
> >> records from the JobTracker's counters itself. Without pruning, the m/r
> job
> >> was pretty much scanning the entire set of rows.
> >>
> >> I am not sure if there is a corner case, where in "skewed" join is
> trying to
> >> override the filtering.
> >>
> >> ~Rajesh.B
> >>
> >>
> >>
> >> On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates <gates@hortonworks.com>
> wrote:
> >> What version of HCatalog are you using?  How do you know it is scanning
> all
> >> the partitions, does it say so in the logs, or are you getting all the
> >> records back?
> >>
> >> And yes, HCat is supposed to do partition pruning so that it only scans
> the
> >> required partitions.
> >>
> >> Alan.
> >>
> >> On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan wrote:
> >>
> >>> Hi All,
> >>>
> >>> I have a hcatalog table "partitioned by (d string)".
> >>>
> >>> I have couple of days worth of data and when i run "show partitions" it
> >>> provides the correct daa.
> >>>
> >>> d=20111215
> >>> d=20111216
> >>> d=20111217
> >>> d=20111218
> >>> d=20111219
> >>> d=20111220
> >>> d=20111221
> >>> d=20111222
> >>> d=20111223
> >>> d=20111224
> >>> d=20111225
> >>> d=20120415
> >>>
> >>> However, when I run PIG with "filter a by d == '20120415'", it ends up
> >>> scanning all data.
> >>>
> >>> Is this a known bug/enhancement in HCatalog?. Ideally, shouldn't it
> scan
> >>> only the d=20120415 directory?
> >>>
> >>> Any pointers would be of great help.
> >>>
> >>>
> >>> --
> >>> ~Rajesh.B
> >>
> >>
> >>
> >>
> >> --
> >> ~Rajesh.B
> >
>
>

Mime
View raw message