incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@hortonworks.com>
Subject Re: Weird behavior with partition filter push-down not working in Pig 0.10 when producing more than one relation with different filters
Date Sat, 26 Jan 2013 19:56:05 GMT
Currently Pig only looks for a filter statement in the very next statement.  So re-arranging
your script to be:

all_rows = load 'some_hive_table' using org.apache.hcatalog.pig.HCatLoader();
filtered_rows = filter all_rows by datetime_partition >= '$ROWS_30_DAYS_AGO'
next_op = foreach filtered_rows generate ...;

should do what you want.  I suspect Pig is just getting confused by the fact that the filter
is embedded in the foreach.

Alan.

On Jan 26, 2013, at 11:48 AM, Timothy Potter wrote:

> I have Pig script that loads data from HCatalog. I filter immediately
> after the load and my filter includes criteria on my partitions. The
> push down works as expected in this scenario. Here's an example of the
> Pig code:
> 
> all_rows = load 'some_hive_table' using org.apache.hcatalog.pig.HCatLoader();
> filtered_rows = foreach (filter all_rows by (datetime_partition >=
> '$ROWS_30_DAYS_AGO'))
>   generate ...;
> 
> In this case, my partition field is datetime_partition.
> 
> 
> However, I also need another set of rows from my "some_hive_table"
> (actual name obfuscated) later in the script such as:
> 
> filtered_rows = foreach (filter all_rows by (datetime_partition >=
> '$ROWS_1_DAYS_AGO'))
>   generate ...;
> 
> What I'm finding is that Pig ends up doing a full-table scan on across
> all partitions, ie. the push-down doesn't occur.
> 
> I tried changing the second filter to re-load the table but that gave
> some weird error "Could not resolve org.apache.hcatalog.pig.HCatLoader
> using imports: [com.dachisgroup.analytics.pig.storage., ,
> org.apache.pig.builtin., org.apache.pig.impl.builtin.]" ... Here's the
> code that produced the error:
> 
> all_rows_2ndpass = load 'some_hive_table' using
> org.apache.hcatalog.pig.HCatLoader();
> filtered_rows = foreach (filter all_rows_2ndpass by
> (datetime_partition >= '$ROWS_1_DAYS_AGO'))
>   generate ...;
> 
> Is this expected behavior with Pig 0.10? I suppose I could split up
> the script into two parts, but that's not ideal.
> 
> Cheers,
> Tim


Mime
View raw message