pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "PigFunctions" by OlgaN
Date Wed, 07 Nov 2007 18:18:54 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigFunctions

New page:
[[Anchor(Writing_your_own_Pig_functions)]]
== Writing your own Pig functions ==

Pig has a number of built-in functions for loading, filtering, aggregating, etc. (A complete
list is available at PigBuiltins.) However, if you want to do something specialized you may
need to write your own function. This page will walk you through how to do this.

[[Anchor(Types_of_functions)]]
=== Types of functions ===
The most important type and commonly used type of functions are EvalFunction. Eval functions
consume a tuple, do some computation, and produce some data 

Eval functions are very flexible, e.g. they can mimic "map" and "reduce" style functions:
      * ''"Map" behavior:'' The output type of an Eval Function is one of: a single value,
a tuple, or a bag of tuples (a Map/Reduce "map" function produces a bag of tuples).
      * ''"Reduce" behavior:'' Recall that in the Pig data model, a tuple may contain fields
of type ''bag''. Hence an Eval Function may perform aggregation or "reducing" by iterating
over a bag of tuples nested within the input tuple. This is how the built-in aggregation function
SUM(...) works, for example.   
   
The other types of functions are:
   * '''Filter Function:''' evalutes to True or False when given a tuple; used to eliminate
unwanted tuples from a relation or bag
   * '''Group Function:''' assigns tuples to group(s) 
   * '''Load Function:''' controls reading of tuples from files
   * '''Store Function:''' controls storing of tuples to files

[[Anchor(Example)]]
==== Example ====
The following example uses each of the five types of functions. It computes the set of unique
IP addresses associated with "good" products drawn from a list of products found on the web.

{{{
register myFunctions.jar
products = LOAD '/productlist.txt' USING MyListStorage() AS (name, price, description, url);
goodProducts = FILTER products BY (price <= '19.99' AND MyFilter(description));
hostnames = FOREACH goodProducts GENERATE MyHostExtractor(url) AS hostname;
uniqueIPs = FOREACH (GROUP hostnames BY MyIPLookup(hostname)) GENERATE group AS ipAddress;
STORE uniqueIPs INTO '/iplist.txt' USING MyListStorage();
}}}

In the above example, !MyListStorage() serves as a load function as well as a store function;
!MyFilter() is a filter function; !MyHostExtractor() is an eval function; !MyIPLookup() is
a group function. `myFunctions.jar` is a jar file that contains the classes for the user-defined
functions.

[[Anchor(How_to_write_functions)]]
=== How to write functions ===

Ready to write your own handy-dandy pig function? Before you start, you will need to know
about the APIs for interacting with the data types (atom, tuple, bag). Click here: PigDataTypeAPIs.

Note: for Pig users with little or no experience in Java, here's a quick link to help you
along: PigJavaForDummies.

Click below to learn how to build your own:
   * EvalFunction
   * FilterFunction
   * GroupFunction
   * StorageFunction (These are the most difficult to write, and usually, the inbuilt ones
should be enough)

[[Anchor(Ok,_I_have_written_my_function,_how_to_use_it?)]]
=== Ok, I have written my function, how to use it? ===

You can use your functions following the steps below:

   * Put all the compiled files used by your function together into a jar file
   * Tell Pig about that jar by the `register &lt;udfJar&gt;` command before using
the function. (If you are using PigLatin in embedded mode, call `PigServer.registerJar()`).
   * Then use your function, as you would use a builtin! Its that simple.

Example:

1. Create your function `/src/myfunc/MyEvalFunc.java`

{{{
package myfunc;

import java.io.IOException;
import java.util.StringTokenizer;
import com.yahoo.pig.EvalFunc;
import com.yahoo.pig.data.DataBag;
import com.yahoo.pig.data.Tuple;

public class MyEvalFunc extends EvalFunc<DataBag>
{
        //@Override
        public void exec(Tuple input, DataBag output) throws IOException
        {
                String str = input.getAtomField(0).strval();
                StringTokenizer tok = new StringTokenizer(str, " \",()*", false);
                while (tok.hasMoreTokens())
                {
                        output.add(new Tuple(tok.nextToken()));
                }
        }
}
}}}

2. Compile your function. Make sure to point java compiler to pig jar file.

{{{
/src/myfunc $ javac -classpath /src/pig.jar MyEvalFunc.java
}}}

3. Create jar file 

{{{
/src/myfunc $ cd ..
/src $  jar cf myfunc.jar myfunc
}}}

4. Use the function through grunt (similar use from script). Note that there is no quotes
around path in the `register` call.

{{{
/src $ java -jar pig.jar -
grunt> register /src/myfunc.jar
grunt> A = load 'students' using PigStorage('\t');
grunt> B = foreach A generate myfunc.MyEvalFunc($0);
grunt> dump B;
({(joe smith)})
({(john adams)})
({(anne white)})
....
}}}

See PigTutorial to see example of embeding Pig and your functions in Java. Use the same procedure
outlined above to create your function jar file.

[[Anchor(Advanced_Features:)]]
==== Advanced Features: ====
   * If you would like your function class to be instantiated with a non-default constructor,
you can use the `define <alias> <funcSpec>` command. (If you are using PigLatin
in embedded mode, call `PigServer.registerFunction()`). 
   * E.g., if I want my class `MyFunc` to be instantiated wih the string `'foo'`, I can write
`define myFuncAlias myFunc('foo')`. I can then use `myFuncAlias` as a normal user-defined
function.
   * Note that only string arguments to constructors are supported.

Mime
View raw message