incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajesh Balamohan <rajesh.balamo...@gmail.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Tue, 24 Apr 2012 02:32:05 GMT
Hi Alan,

Thanks for the quick response.

I am using HCatalog 0.4.

With simple PIG script it works great. HCatalog beautifully scans only the
relevant information. However, full scan happens only when we have couple
of additional joins and when we change the INNER JOIN order (we also use
"using skewed").

Though we have looked into the debug logs, we saw the scanning of number of
records from the JobTracker's counters itself. Without pruning, the m/r job
was pretty much scanning the entire set of rows.

I am not sure if there is a corner case, where in "skewed" join is trying
to override the filtering.

~Rajesh.B



On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates <gates@hortonworks.com> wrote:

> What version of HCatalog are you using?  How do you know it is scanning
> all the partitions, does it say so in the logs, or are you getting all the
> records back?
>
> And yes, HCat is supposed to do partition pruning so that it only scans
> the required partitions.
>
> Alan.
>
> On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan wrote:
>
> > Hi All,
> >
> > I have a hcatalog table "partitioned by (d string)".
> >
> > I have couple of days worth of data and when i run "show partitions" it
> provides the correct daa.
> >
> > d=20111215
> > d=20111216
> > d=20111217
> > d=20111218
> > d=20111219
> > d=20111220
> > d=20111221
> > d=20111222
> > d=20111223
> > d=20111224
> > d=20111225
> > d=20120415
> >
> > However, when I run PIG with "filter a by d == '20120415'", it ends up
> scanning all data.
> >
> > Is this a known bug/enhancement in HCatalog?. Ideally, shouldn't it scan
> only the d=20120415 directory?
> >
> > Any pointers would be of great help.
> >
> >
> > --
> > ~Rajesh.B
>
>


-- 
~Rajesh.B

Mime
View raw message