hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject UDFs and types
Date Wed, 02 Jul 2008 20:43:32 GMT
With the introduction of types (see 
http://issues.apache.org/jira/browse/PIG-157) we need to decide how 
EvalFunc will interact with the types.  The original proposal was that 
the DEFINE keyword would be modified to allow specification of types for 
the UDF.  This has a couple of problems.  One, DEFINE is already used to 
specify constructor arguments.  Using it to also specify types will be 
confusing.  Two, it has been pointed out that this type information is a 
property of the UDF and should therefore be declared by the UDF, not in 
the script.

Separately, as a way to allow simple function overloading, a change had 
been proposed to the EvalFunc interface to allow an EvalFunc to specify 
that for a given type, a different instance of EvalFunc should be used 
(see https://issues.apache.org/jira/browse/PIG-276).

I would like to propose that we expand the changes in PIG-276 to be more 
general.  Rather than adding classForType() as proposed in PIG-276, 
EvalFunc will instead add a function:

public Map<Schema, FuncSpec> getArgToFuncMapping() {
    return null;
}

Where FuncSpec is a new class that contains the name of the class that 
implements the UDF along with any necessary arguments for the constructor.

The type checker will then, as part of type checking LOUserFunc make a 
call to this function.  If it receives a null, it will simply leave the 
UDF as is, and make the assumption that the UDF can handle whatever 
datatype is being provided to it.  This will cover most existing UDFs, 
which will not override the default implementation.

If a UDF wants to override the default, it should return a map that 
gives a FuncSpec for each type of schema that it can support.  For 
example, for the UDF concat, the map would have two entries:
key: schema(chararray, chararray) value: StringConcat
key: schema(bytearray, bytearray) value: ByteConcat

The type checker will then take the schema of what is being passed to it 
and perform a lookup in the map.  If it finds an entry, it will use the 
associated FuncSpec.  If it does not, it will throw an exception saying 
that that EvalFunc cannot be used with those types.

At this point, the type checker will make no effort to find a best fit 
function.  Either the fit is perfect, or it will not be done.  In the 
future we would like to modify the type checker to select a best fit.  
For example, if a UDF says it can handle schema(long) and the type 
checker finds it has schema(int), it can insert a cast to deal with 
that.  But in the first pass we will ignore this and depend on the user 
to insert the casts.

Thoughts?

Alan.

Mime
View raw message