incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@hortonworks.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Tue, 24 Apr 2012 03:45:21 GMT
If possible could you share a pig latin script that scans only the proper partitions and one
that scans everything?  That would help us see what the issue is.

Alan.

On Apr 23, 2012, at 7:32 PM, Rajesh Balamohan wrote:

> Hi Alan,
> 
> Thanks for the quick response.
> 
> I am using HCatalog 0.4.
> 
> With simple PIG script it works great. HCatalog beautifully scans only the relevant information.
However, full scan happens only when we have couple of additional joins and when we change
the INNER JOIN order (we also use "using skewed"). 
> 
> Though we have looked into the debug logs, we saw the scanning of number of records from
the JobTracker's counters itself. Without pruning, the m/r job was pretty much scanning the
entire set of rows.
> 
> I am not sure if there is a corner case, where in "skewed" join is trying to override
the filtering.
> 
> ~Rajesh.B
> 
> 
> 
> On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates <gates@hortonworks.com> wrote:
> What version of HCatalog are you using?  How do you know it is scanning all the partitions,
does it say so in the logs, or are you getting all the records back?
> 
> And yes, HCat is supposed to do partition pruning so that it only scans the required
partitions.
> 
> Alan.
> 
> On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan wrote:
> 
> > Hi All,
> >
> > I have a hcatalog table "partitioned by (d string)".
> >
> > I have couple of days worth of data and when i run "show partitions" it provides
the correct daa.
> >
> > d=20111215
> > d=20111216
> > d=20111217
> > d=20111218
> > d=20111219
> > d=20111220
> > d=20111221
> > d=20111222
> > d=20111223
> > d=20111224
> > d=20111225
> > d=20120415
> >
> > However, when I run PIG with "filter a by d == '20120415'", it ends up scanning
all data.
> >
> > Is this a known bug/enhancement in HCatalog?. Ideally, shouldn't it scan only the
d=20120415 directory?
> >
> > Any pointers would be of great help.
> >
> >
> > --
> > ~Rajesh.B
> 
> 
> 
> 
> -- 
> ~Rajesh.B


Mime
View raw message