pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "EvalFunction" by OlgaN
Date Wed, 07 Nov 2007 18:38:51 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:

New page:
== Eval Functions ==
To create an eval function, the following abstract class must be extended. The parameter T
is the return type of the eval function. 

public abstract class EvalFunc<T extends Datum>  {
    abstract public void exec(Tuple input, T output) throws IOException;

=== Input to the Functions ===
The arguments to the function get wrapped in a tuple and are passed as the parameter `input`
above. Thus, the first field of `input` is the first argument and so on. 

For example, suppose I have a data set A = 
<a, b, c>
<1, 2, 3>

Suppose, I have written an Eval Function !MyFunc and my !PigLatin is as follows:

B = foreach A generate MyFunc($0,$2);

Then !MyFunc will be called first with the tuple <a, c> and then with the tuple <1,

=== Output of the functions ===

When extending the abstract class, the type parameter T must be bound to a subclass of Datum.
(The compiler will allow you to subclass !EvalFunc<Datum> but you will get an error
on using that function). When T is bound to a particular type of Datum ( !DataAtom, or Tuple,
or !DataBag, or !DataMap), the eval function gets handed, through the parameter `output`,
a Datum of type T to produce its output in. 

Note that in case T is a databag, although you get handed a !DataBag as the parameter `output`,
this is an append-only data bag. Its contents always remain empty. This is a performance optimization
(we use it for pipelining) based on the assumption that you wouldnt want to examine your own

=== Example ===

As an example, here is the code for the builtin function TOKENIZE, that expects as input 1
argument of type data atom, and tokenizes the input data atom string to a data bag of tuples,
one for each word in the input string.

public class TOKENIZE extends EvalFunc<DataBag> {

    public void exec(Tuple input, DataBag output) throws IOException {
        String str = input.getAtomField(0).strval();
        StringTokenizer tok = new StringTokenizer(str, " \",()*", false);
        while (tok.hasMoreTokens()) {
            output.add(new Tuple(tok.nextToken()));

=== Advanced Features ===
   * '''Schemas''': Eval functions can declare their output schema by overriding the following
method in !EvalFunc. See: PigLatinSchemas.

     * @param input Schema of the input
     * @return Schema of the output
    public Schema outputSchema(Schema input)
          return input.copy();

   * '''Algebraic Eval Functions''' If the input to your function might be large (i.e. the
input tuple may contain a large bag of tuples nested inside of it) and you are concerned about
performance, you may want to consider writing your function in such a way that it can receive
its input in small "chunks," one at a time, and then merge the per-chunk outputs to obtain
the final output. (In the map/reduce model, the "combiner" feature does this.) To enable this
feature, your eval function must implement the interface Algebraic. See AlgebraicEvalFunc
for details.

   * '''Final cleanup action''' If your function needs to do some final action after being
called the last time for a particular input set, it can override the finish method of the
class !EvalFunc.
     * Placeholder for cleanup to be performed at the end. User defined functions can override.
    public void finish(){}

View raw message