incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Weise <...@yahoo-inc.com>
Subject Re: HCatalog scans all partition even after mentioning date filter
Date Tue, 24 Apr 2012 22:59:46 GMT
There was a defect in Pig that was fixed for 0.9.2 that would cause the
partition filter to not be available to Hcat:

https://issues.apache.org/jira/browse/PIG-2339

Which pig version are you using?

Thomas


On 4/23/12 8:45 PM, "Alan Gates" <gates@hortonworks.com> wrote:

> If possible could you share a pig latin script that scans only the proper
> partitions and one that scans everything?  That would help us see what the
> issue is.
> 
> Alan.
> 
> On Apr 23, 2012, at 7:32 PM, Rajesh Balamohan wrote:
> 
>> Hi Alan,
>> 
>> Thanks for the quick response.
>> 
>> I am using HCatalog 0.4.
>> 
>> With simple PIG script it works great. HCatalog beautifully scans only the
>> relevant information. However, full scan happens only when we have couple of
>> additional joins and when we change the INNER JOIN order (we also use "using
>> skewed"). 
>> 
>> Though we have looked into the debug logs, we saw the scanning of number of
>> records from the JobTracker's counters itself. Without pruning, the m/r job
>> was pretty much scanning the entire set of rows.
>> 
>> I am not sure if there is a corner case, where in "skewed" join is trying to
>> override the filtering.
>> 
>> ~Rajesh.B
>> 
>> 
>> 
>> On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates <gates@hortonworks.com> wrote:
>> What version of HCatalog are you using?  How do you know it is scanning all
>> the partitions, does it say so in the logs, or are you getting all the
>> records back?
>> 
>> And yes, HCat is supposed to do partition pruning so that it only scans the
>> required partitions.
>> 
>> Alan.
>> 
>> On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan wrote:
>> 
>>> Hi All,
>>> 
>>> I have a hcatalog table "partitioned by (d string)".
>>> 
>>> I have couple of days worth of data and when i run "show partitions" it
>>> provides the correct daa.
>>> 
>>> d=20111215
>>> d=20111216
>>> d=20111217
>>> d=20111218
>>> d=20111219
>>> d=20111220
>>> d=20111221
>>> d=20111222
>>> d=20111223
>>> d=20111224
>>> d=20111225
>>> d=20120415
>>> 
>>> However, when I run PIG with "filter a by d == '20120415'", it ends up
>>> scanning all data.
>>> 
>>> Is this a known bug/enhancement in HCatalog?. Ideally, shouldn't it scan
>>> only the d=20120415 directory?
>>> 
>>> Any pointers would be of great help.
>>> 
>>> 
>>> --
>>> ~Rajesh.B
>> 
>> 
>> 
>> 
>> -- 
>> ~Rajesh.B
> 


Mime
View raw message