incubator-hcatalog-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <thelabd...@gmail.com>
Subject Re: partition filter on condition determined by UDF
Date Wed, 21 Nov 2012 00:01:37 GMT
Ok, update on this - I have a question out on the Pig mailing list, but
after looking at the PColFilterExtractor code, I can see that this approach
is problematic, so I'll resort to calculating the values in an external
script before I call the Pig code.

Cheers,
Tim

On Tue, Nov 20, 2012 at 1:27 PM, Timothy Potter <thelabdude@gmail.com>wrote:

> It doesn't seem like I'm able to call a UDF to determine the value of my
> partition filter condition. For example, I'd like to do this within a Pig
> MACRO:
>
> DEFINE load_recent_signals(days, end_timebucket) return RECENT_SIGNALS {
>
>   signals = load 'signals' using org.apache.hcatalog.pig.HCatLoader();
>
>   $RECENT_SIGNALS = foreach (
>
>     filter signals by (
>
>      datetime_partition >= TimebucketToDatePartition($end_timebucket -
> (86400000L*$num_days)) AND
>
>      datetime_partition <= TimebucketToDatePartition($end_timebucket) AND
>
>      relationship_id IS NOT NULL
>
>     )) {
>
>       generate ...;
>
>   };
>
> };
>
> The TimebucketToDatePartition is a UDF that determines the partition value
> (a STRING) based on a timestamp (LONG).
>
> When I run this, I get the error that the filter couldn't be "pushed" into
> the load, which makes partitioning worthless. I have big data so
> partitioning is VERY important.
>
> Of course, I also tried evaluating the UDFs when I call in the MACRO, but
> of course the Pig grammar is so limited that it doesn't recognize UDF calls
> to determine parameter values, i.e.
>
> signals_in =
> load_recent_signals(TimebucketToDatePartition(1351612800000L),
> TimebucketToDatePartition(1351785600000L));
>
> This results in error: ERROR 1200: <line 5, column 58>  mismatched input
> '(' expecting RIGHT_PAREN
>
> So I'm at a loss as to what I can do here. Seems like evaluating a UDF for
> a partition filter is a sensical thing to do with HCatalog and Pig.
>
> I'm willing to crack open the code and fix this if someone can provide
> some advice on how to go about this issue, i.e. should I try to fix the Pig
> grammar to allow UDFs to be called when evaluating MACRO parameters or try
> to fix the HCatalog side to allow me to call a UDF to determine filter
> conditions.
>
> <rant>So far, I've had nothing but trouble with HCatalog and filtering by
> partition keys in Pig. Isn't this one of the the primary use cases of
> HCatalog?</rant>
>

Mime
View raw message