pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "UDFManual" by AlanGates
Date Wed, 06 Oct 2010 23:42:50 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "UDFManual" page has been changed by AlanGates.
http://wiki.apache.org/pig/UDFManual?action=diff&rev1=16&rev2=17

--------------------------------------------------

  Now that we have the function implemented, it needs to be compiled and included in a jar.
You will need to build `pig.jar` to compile your UDF. You can use the following set of commands
to checkout the code from SVN repository and create pig.jar:
  
  {{{
- svn co http://svn.apache.org/repos/asf/hadoop/pig/trunk
+ svn co http://svn.apache.org/repos/asf/pig/trunk
  cd trunk
  ant
  }}}
@@ -105, +105 @@

  
  An aggregate function is an eval function that takes a bag and returns a scalar value. One
interesting and useful property of many aggregate functions is that they can be computed incrementally
in a distributed fashion. We call these functions `algebraic`. `COUNT` is an example of an
algebraic function because we can count the number of elements in a subset of the data and
then sum the counts to produce a final output. In the Hadoop world, this means that the partial
computations can be done by the map and combiner, and the final result can be computed by
the reducer.
  
- It is very important for performance to make sure that aggregate functions that are algebraic
are implemented as such. Let's look at the implementation of the COUNT function to see what
this means. (Error handling and some other code is omitted to save space. The full code can
be accessed [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/COUNT.java?view=markup|here]].)
+ It is very important for performance to make sure that aggregate functions that are algebraic
are implemented as such. Let's look at the implementation of the COUNT function to see what
this means. (Error handling and some other code is omitted to save space. The full code can
be accessed [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/builtin/COUNT.java?view=markup|here]].)
  
  {{{#!java
  public class COUNT extends EvalFunc<Long> implements Algebraic{
@@ -229, +229 @@

  || bag || !DataBag ||
  || map || Map<Object, Object> ||
  
- All Pig-specific classes are available [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/data/|here]]
+ All Pig-specific classes are available [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/data/|here]]
  
  `Tuple` and `DataBag` are different in that they are not concrete classes but rather interfaces.
This enables users to extend Pig with their own versions of tuples and bags. As a result,
UDFs cannot directly instantiate bags or tuples; they need to go through factory classes:
`TupleFactory` and `BagFactory`.
  
@@ -615, +615 @@

  
  <<Anchor(Load_Functions)>>
  === Load Functions ===
- Every load function needs to implement the `LoadFunc` interface. An abbreviated version
is shown below. The full definition can be seen [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup|here]].
+ Every load function needs to implement the `LoadFunc` interface. An abbreviated version
is shown below. The full definition can be seen [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup|here]].
  
  {{{#!java
  public interface LoadFunc {
@@ -649, +649 @@

  
  In this query, only `age` needs to be converted to its actual type (=int=) right away. `name`
only needs to be converted in the next step of processing where the data is likely to be much
smaller. `gpa` is not used at all and will never need to be converted.
  
- This is the main reason for Pig to separate the reading of the data (which can happen immediately)
from the converting of the data (to the right type, which can happen later). For ASCII data,
Pig provides `Utf8StorageConverter` that your loader class can extend and will take care of
all the conversion routines. The code for it can be found [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup|here]].
+ This is the main reason for Pig to separate the reading of the data (which can happen immediately)
from the converting of the data (to the right type, which can happen later). For ASCII data,
Pig provides `Utf8StorageConverter` that your loader class can extend and will take care of
all the conversion routines. The code for it can be found [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup|here]].
  
  Note that conversion rutines should return null values for data that can't be converted
to the specified type.
  
@@ -683, +683 @@

  
  Note that this approach assumes that the data has a uniform schema. The function needs to
make sure that the data it produces conforms to the schema returned by `determineSchema`,
otherwise the processing will fail. This means producing the right number of fields in the
tuple (dropping fields or emitting null values if needed) and producing fields of the right
type (again emitting null values as needed).
  
- For complete examples, see [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/BinStorage.java?view=markup|BinStroage]]
and [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/PigStorage.java?view=markup|PigStorage]].
+ For complete examples, see [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/builtin/BinStorage.java?view=markup|BinStroage]]
and [[http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/builtin/PigStorage.java?view=markup|PigStorage]].
  
  <<Anchor(Store_Functions)>>
  === Store Functions ===

Mime
View raw message