If I remember correctly upgrading to pig 0.9.3 fixes this.  Or its fixed in 0.4.1 hcat. Can't remember which. Try pig first since 0.4.1 isn't out.

On May 15, 2012 10:53 PM, "Rajesh Balamohan" <rajesh.balamohan@gmail.com> wrote:
Hi All,

I am currently using the following. In certain scenario the filter condition is not applied and it ends up scanning the entire data. Sample is given below.


Pig 0.9.0
HCatalog 0.4.0
Hadoop 0.20.20x

dim_referrer = LOAD 'tableA' USING org.apache.hcatalog.pig.HCatLoader();
source_data = LOAD 'tableB' USING org.apache.hcatalog.pig.HCatLoader();
source_data_new = FILTER source_data BY d =='20120415';
joined_data_referrer = JOIN source_data_new BY referrer LEFT OUTER, dim_referrer BY referrer_url using 'skewed';
dump joined_data_referrer;

In this case, all records are scanned and the filtering is not applied by HCatalog.

Shouldn't it apply the filter first and then do the sampling M/R job required for "skewed" join?

Is this a known issue. Any pointers would be of great help.



--
~Rajesh.B