hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich" <ol...@yahoo-inc.com>
Subject RE: UDFs and types
Date Wed, 02 Jul 2008 20:55:42 GMT
Sounds good to me.

Olga 

> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com] 
> Sent: Wednesday, July 02, 2008 1:44 PM
> To: pig-dev@incubator.apache.org
> Subject: UDFs and types
> 
> With the introduction of types (see
> http://issues.apache.org/jira/browse/PIG-157) we need to 
> decide how EvalFunc will interact with the types.  The 
> original proposal was that the DEFINE keyword would be 
> modified to allow specification of types for the UDF.  This 
> has a couple of problems.  One, DEFINE is already used to 
> specify constructor arguments.  Using it to also specify 
> types will be confusing.  Two, it has been pointed out that 
> this type information is a property of the UDF and should 
> therefore be declared by the UDF, not in the script.
> 
> Separately, as a way to allow simple function overloading, a 
> change had been proposed to the EvalFunc interface to allow 
> an EvalFunc to specify that for a given type, a different 
> instance of EvalFunc should be used (see 
> https://issues.apache.org/jira/browse/PIG-276).
> 
> I would like to propose that we expand the changes in PIG-276 
> to be more general.  Rather than adding classForType() as 
> proposed in PIG-276, EvalFunc will instead add a function:
> 
> public Map<Schema, FuncSpec> getArgToFuncMapping() {
>     return null;
> }
> 
> Where FuncSpec is a new class that contains the name of the 
> class that implements the UDF along with any necessary 
> arguments for the constructor.
> 
> The type checker will then, as part of type checking 
> LOUserFunc make a call to this function.  If it receives a 
> null, it will simply leave the UDF as is, and make the 
> assumption that the UDF can handle whatever datatype is being 
> provided to it.  This will cover most existing UDFs, which 
> will not override the default implementation.
> 
> If a UDF wants to override the default, it should return a 
> map that gives a FuncSpec for each type of schema that it can 
> support.  For example, for the UDF concat, the map would have 
> two entries:
> key: schema(chararray, chararray) value: StringConcat
> key: schema(bytearray, bytearray) value: ByteConcat
> 
> The type checker will then take the schema of what is being 
> passed to it and perform a lookup in the map.  If it finds an 
> entry, it will use the associated FuncSpec.  If it does not, 
> it will throw an exception saying that that EvalFunc cannot 
> be used with those types.
> 
> At this point, the type checker will make no effort to find a 
> best fit function.  Either the fit is perfect, or it will not 
> be done.  In the future we would like to modify the type 
> checker to select a best fit.  
> For example, if a UDF says it can handle schema(long) and the 
> type checker finds it has schema(int), it can insert a cast 
> to deal with that.  But in the first pass we will ignore this 
> and depend on the user to insert the casts.
> 
> Thoughts?
> 
> Alan.
> 

Mime
View raw message