pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From o...@apache.org
Subject svn commit: r1050082 [6/6] - in /pig/trunk: ./ src/docs/src/documentation/content/xdocs/
Date Thu, 16 Dec 2010 18:10:59 GMT
Added: pig/trunk/src/docs/src/documentation/content/xdocs/test.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/test.xml?rev=1050082&view=auto
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/test.xml (added)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/test.xml Thu Dec 16 18:10:59 2010
@@ -0,0 +1,949 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
+<document>
+  <header>
+    <title>Testing and Diagnostics</title>
+  </header>
+  <body>
+
+<!-- =========================================================================== -->
+<!-- DIAGNOSTIC OPERATORS -->    
+<section>
+	<title>Diagnostic Operators</title>
+	
+ <!-- +++++++++++++++++++++++++++++++++++++++ --> 
+   <section>
+   <title>DESCRIBE</title>
+   <p>Returns the schema of a relation.</p>
+   
+   <section>
+   <title>Syntax</title>
+   <table>
+      <tr> 
+            <td>
+               <p>DESCRIBE alias;        </p>
+            </td>
+         </tr> 
+   </table></section>
+   
+   <section>
+   <title>Terms</title>
+   <table>
+      <tr>
+            <td>
+               <p>alias</p>
+            </td>
+            <td>
+               <p>The name of a relation.</p>
+            </td>
+         </tr> 
+   </table></section>
+   
+   <section>
+   <title>Usage</title>
+   <p>Use the DESCRIBE operator to view the schema of a relation. 
+   You can view outer relations as well as relations defined in a nested FOREACH statement.</p>
+   </section>
+   
+   <section>
+   <title>Example</title>
+   <p>In this example a schema is specified using the AS clause. If all data conforms to the schema, Pig will use the assigned types.</p>
+<source>
+A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
+
+B = FILTER A BY name matches 'J.+';
+
+C = GROUP B BY name;
+
+D = FOREACH B GENERATE COUNT(B.age);
+
+DESCRIBE A;
+A: {group, B: (name: chararray,age: int,gpa: float}
+
+DESCRIBE B;
+B: {group, B: (name: chararray,age: int,gpa: float}
+
+DESCRIBE C;
+C: {group, chararry,B: (name: chararray,age: int,gpa: float}
+
+DESCRIBE D;
+D: {long}
+</source>
+   
+   <p>In this example no schema is specified. All fields default to type bytearray or long (see Data Types).</p>
+<source>
+a = LOAD 'student';
+
+b = FILTER a BY $0 matches 'J.+';
+
+c = GROUP b BY $0;
+
+d = FOREACH c GENERATE COUNT(b.$1);
+
+DESCRIBE a;
+Schema for a unknown.
+
+DESCRIBE b;
+2008-12-05 01:17:15,316 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly cast to chararray under LORegexp Operator
+Schema for b unknown.
+
+DESCRIBE c;
+2008-12-05 01:17:23,343 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly caste to chararray under LORegexp Operator
+c: {group: bytearray,b: {null}}
+
+DESCRIBE d;
+2008-12-05 03:04:30,076 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly caste to chararray under LORegexp Operator
+d: {long}
+</source>
+   
+ <p>This example shows how to view the schema of a nested relation using the :: operator.</p>  
+ <source>
+A = LOAD 'studentab10k' AS (name, age, gpa); 
+B = GROUP A BY name; 
+C = FOREACH B { 
+     D = DISTINCT A.age; 
+     GENERATE COUNT(D), group;} 
+
+DESCRIBE C::D; 
+D: {age: bytearray} 
+</source>
+   </section></section>
+   
+ <!-- +++++++++++++++++++++++++++++++++++++++ -->   
+ <section>
+   <title>DUMP</title>
+   <p>Dumps or displays results to screen.</p>
+   
+   <section>
+   <title>Syntax</title>
+   <table>
+      <tr> 
+            <td>
+               <p>DUMP alias;        </p>
+            </td>
+         </tr> 
+   </table></section>
+   
+   <section>
+   <title>Terms</title>
+   <table>
+      <tr>
+            <td>
+               <p>alias</p>
+            </td>
+            <td>
+               <p>The name of a relation.</p>
+            </td>
+         </tr> 
+   </table></section>
+   
+   <section>
+   <title>Usage</title>
+   <p>Use the DUMP operator to run (execute) Pig Latin statements and display the results to your screen. DUMP is meant for interactive mode; statements are executed immediately and the results are not saved (persisted). You can use DUMP as a debugging device to make sure that the results you are expecting are actually generated. </p>
+   
+   <p>
+   Note that production scripts SHOULD NOT use DUMP as it will disable multi-query optimizations and is likely to slow down execution 
+   (see <a href="perf.html#Store+vs.+Dump">Store vs. Dump</a>).
+   </p>
+   </section>
+   
+   <section>
+   <title>Example</title>
+   <p>In this example a dump is performed after each statement.</p>
+<source>
+A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
+
+DUMP A;
+(John,18,4.0F)
+(Mary,19,3.7F)
+(Bill,20,3.9F)
+(Joe,22,3.8F)
+(Jill,20,4.0F)
+
+B = FILTER A BY name matches 'J.+';
+
+DUMP B;
+(John,18,4.0F)
+(Joe,22,3.8F)
+(Jill,20,4.0F)
+</source>
+</section></section>      
+   
+ <!-- +++++++++++++++++++++++++++++++++++++++ -->
+   <section>
+   <title>EXPLAIN</title>
+   <p>Displays execution plans.</p>
+   
+   <section>
+   <title>Syntax</title>
+   <table>
+      <tr> 
+            <td>
+               <p>EXPLAIN [–script pigscript] [–out path] [–brief] [–dot] [–param param_name = param_value] [–param_file file_name] alias; </p>
+            </td>
+         </tr> 
+   </table></section>
+   
+   <section>
+   <title>Terms</title>
+   <table>
+    
+         <tr>
+            <td>
+               <p>–script</p>
+            </td>
+            <td>
+               <p>Use to specify a pig script.</p>
+            </td>
+         </tr>      
+
+         <tr>
+            <td>
+               <p>–out</p>
+            </td>
+            <td>
+               <p>Use to specify the output path (directory).</p>
+               <p>Will generate a logical_plan[.txt|.dot], physical_plan[.text|.dot], exec_plan[.text|.dot] file in the specified path.</p>
+               <p>Default (no path specified): Stdout </p>
+            </td>
+         </tr>
+
+         <tr>
+            <td>
+               <p>–brief</p>
+            </td>
+            <td>
+               <p>Does not expand nested plans (presenting a smaller graph for overview). </p>
+            </td>
+         </tr>
+         
+         <tr>
+            <td>
+               <p>–dot</p>
+            </td>
+            <td>
+
+               <p>Text mode (default): multiple output (split) will be broken out in sections.  </p>
+               <p>Dot mode: outputs a format that can be passed to the dot utility for graphical display – 
+               will generate a directed-acyclic-graph (DAG) of the plans in any supported format (.gif, .jpg ...).</p>
+            </td>
+         </tr>
+
+         <tr>
+            <td>
+               <p>–param param_name = param_value</p>
+            </td>
+            <td>
+               <p>See <a href="cont.html#Parameter+Substitution">Parameter Substitution</a>.</p>
+            </td>
+         </tr>
+
+         <tr>
+            <td>
+               <p>–param_file file_name</p>
+            </td>
+            <td>
+               <p>See <a href="cont.html#Parameter+Substitution">Parameter Substitution</a>. </p>
+            </td>
+         </tr>
+      
+      <tr>
+            <td>
+               <p>alias</p>
+            </td>
+            <td>
+               <p>The name of a relation.</p>
+            </td>
+         </tr>
+         
+    
+   </table></section>
+   
+   <section>
+   <title>Usage</title>
+   <p>Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship. </p>
+   <p>If no script is given:</p>
+
+   <ul>	
+      <li>
+         <p>The logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backend-independent optimizations (such as applying filters early on) also apply.</p>
+      </li>
+      <li>
+         <p>The physical plan shows how the logical operators are translated to backend-specific physical operators. Some backend optimizations also apply.</p>
+      </li>
+      <li>
+         <p>The map reduce plan shows how the physical operators are grouped into map reduce jobs.</p>
+      </li>
+  </ul> 
+  <p></p>
+   <p>If a script without an alias is specified, it will output the entire execution graph (logical, physical, or map reduce). </p>
+   <p>If a script with a alias is specified, it will output the plan for the given alias. </p>
+   </section>
+   
+   <section>
+   <title>Example</title>
+   <p>In this example the EXPLAIN operator produces all three plans. (Note that only a portion of the output is shown in this example.)</p>
+
+ <source>
+A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
+
+B = GROUP A BY name;
+
+C = FOREACH B GENERATE COUNT(A.age);
+
+EXPLAIN C;
+-----------------------------------------------
+Logical Plan:
+-----------------------------------------------
+Store xxx-Fri Dec 05 19:42:29 UTC 2008-23 Schema: {long} Type: Unknown
+|
+|---ForEach xxx-Fri Dec 05 19:42:29 UTC 2008-15 Schema: {long} Type: bag
+ <em>etc ... </em> 
+
+-----------------------------------------------
+Physical Plan:
+-----------------------------------------------
+Store(fakefile:org.apache.pig.builtin.PigStorage) - xxx-Fri Dec 05 19:42:29 UTC 2008-40
+|
+|---New For Each(false)[bag] - xxx-Fri Dec 05 19:42:29 UTC 2008-39
+    |   |
+    |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - xxx-Fri Dec 05 
+ <em>etc ... </em> 
+
+--------------------------------------------------
+| Map Reduce Plan                               
+-------------------------------------------------
+MapReduce node xxx-Fri Dec 05 19:42:29 UTC 2008-41
+Map Plan
+Local Rearrange[tuple]{chararray}(false) - xxx-Fri Dec 05 19:42:29 UTC 2008-34
+|   |
+|   Project[chararray][0] - xxx-Fri Dec 05 19:42:29 UTC 2008-35
+ <em>etc ... </em> 
+
+</source> 
+ </section></section>
+  
+  
+ <!-- +++++++++++++++++++++++++++++++++++++++ -->
+   <section>
+   <title>ILLUSTRATE</title>
+   <p>Displays a step-by-step execution of a sequence of statements.</p>
+
+   <section>
+   <title>Syntax</title>
+   <table>
+      <tr> 
+            <td>
+               <p>ILLUSTRATE {alias | -script scriptfile}; </p>
+            </td>
+         </tr> 
+   </table></section>
+   
+   <section>
+   <title>Terms</title>
+   <table>
+      <tr>
+            <td>
+               <p>alias</p>
+            </td>
+            <td>
+               <p>The name of a relation.</p>
+            </td>
+         </tr> 
+            
+      <tr>
+            <td>
+               <p>-script scriptfile</p>
+            </td>
+            <td>
+               <p>The script keyword followed by the name of a Pig script file (for example, myscript.pig). </p>
+               <p>The script file should not contain an ILLUSTRATE statement.</p>
+            </td>
+         </tr> 
+   </table></section>
+   
+   <section>
+   <title>Usage</title>
+   <p>Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig Latin statements. 
+   You can run ILLUSTRATE with a relation or a Pig script.</p>
+
+
+   <p>ILLUSTRATE accesses the ExampleGenerator algorithm which can select an appropriate and concise set of example data automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate all the sampled data, giving you empty results which will not help with debugging. </p>
+   
+   <p>With the ILLUSTRATE operator you can test your programs on small datasets and get faster turnaround times. The ExampleGenerator algorithm uses Pig's local mode (rather than Pig's mapreduce mode) which means that illustrative example data is generated in near real-time.</p>
+
+   </section>
+   
+   <section>
+   <title>Example - Relation</title>
+   <p>This example demonstrates how to use ILLUSTRATE with a relation. Note that the LOAD statement must include a schema (the AS clause).</p>
+
+ <source>
+visits = LOAD 'visits.txt' AS (user:chararray, url:chararray, timestamp:chararray);
+
+DUMP visits;
+(Amy,cnn.com,20080218)
+(Fred,harvard.edu,20081204)
+(Amy,bbc.com,20081205)
+(Fred,stanford.edu,20081206)
+
+recent_visits = FILTER visits BY timestamp >= '20081201';
+
+user_visits = GROUP recent_visits BY user;
+
+num_user_visits = FOREACH user_visits GENERATE group, COUNT(recent_visits);
+
+DUMP num_user_visits;
+(1L)
+(2L)
+
+ILLUSTRATE num_user_visits;
+------------------------------------------------------------------------
+| visits     | user: bytearray | ulr: bytearray | timestamp: bytearray |
+------------------------------------------------------------------------
+|            | Amy             | cnn.com        | 20080218             |
+|            | Fred            | harvard.edu    | 20081204             |
+|            | Amy             | bbc.com        | 20081205             |
+|            | Fred            | stanford.edu   | 20081206             |
+------------------------------------------------------------------------
+
+------------------------------------------------------------------------
+| visits     | user: chararray | ulr: chararray | timestamp: chararray |
+------------------------------------------------------------------------
+|            | Amy             | cnn.com        | 20080218             |
+|            | Fred            | harvard.edu    | 20081204             |
+|            | Amy             | bbc.com        | 20081205             |
+|            | Fred            | stanford.edu   | 20081206             |
+------------------------------------------------------------------------
+
+-------------------------------------------------------------------------------
+| recent_visits     | user: chararray | ulr: chararray | timestamp: chararray |
+-------------------------------------------------------------------------------
+|                   | Fred            | harvard.edu    | 20081204             |
+|                   | Amy             | bbc.com        | 20081205             |
+|                   | Fred            | stanford.edu   | 20081206             |
+-------------------------------------------------------------------------------
+
+------------------------------------------------------------------------------------------------------------------
+| user_visits     | group: chararray | recent_visits: bag({user: chararray,ulr: chararray,timestamp: chararray}) |
+------------------------------------------------------------------------------------------------------------------
+|                 | Amy              | {(Amy, bbc.com, 20081205)}                                                |
+|                 | Fred             | {(Fred, harvard.edu, 20081204), (Fred, stanford.edu, 20081206)}           |
+------------------------------------------------------------------------------------------------------------------
+
+-------------------------------
+| num_user_visits     | long  |
+------------------------------
+|                     | 1     |
+|                     | 2     |
+-------------------------------
+</source>
+</section>
+
+   <section>
+   <title>Example - Script</title>
+ <p>This example demonstrates how to use ILLUSTRATE with a script. Note that the script itself should not contain an ILLUSTRATE statement.</p>
+</section>
+<source>
+
+
+</source>
+
+</section>
+</section>
+
+<!-- =========================================================================== -->
+<!-- DIAGNOSTIC OPERATORS -->    
+<section>
+<title>Pig Scripts and MapReduce Job IDs</title>
+   <p>Complex Pig scripts often generate many MapReduce jobs. To help you debug a script, Pig prints a summary of the execution that shows which relations (aliases) are mapped to each MapReduce job. </p>
+<source>
+JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime 
+    MinReduceTime AvgReduceTime Alias Feature Outputs
+job_201004271216_12712 1 1 3 3 3 12 12 12 B,C GROUP_BY,COMBINER
+job_201004271216_12713 1 1 3 3 3 12 12 12 D SAMPLER
+job_201004271216_12714 1 1 3 3 3 12 12 12 D ORDER_BY,COMBINER 
+    hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp743703298/tmp-2019944040,
+</source>
+
+</section>
+
+<!-- ==================================================================== -->
+<!-- PIG STATISTICS-->
+<section>
+<title>Pig Statistics</title>
+
+<p>Pig Statistics is a framework for collecting and storing script-level statistics for Pig Latin. Characteristics of Pig Latin scripts and the resulting MapReduce jobs are collected while the script is executed. These statistics are then available for Pig users and tools using Pig (such as Oozie) to retrieve after the job is done.</p>
+
+<p>The new Pig statistics and the existing Hadoop statistics can also be accessed via the Hadoop job history file (and job xml file). Piggybank has a HadoopJobHistoryLoader which acts as an example of using Pig itself to query these statistics (the loader can be used as a reference implementation but is NOT supported for production use).</p>
+
+<section>
+<title>Java API</title>
+
+<p>Several new public classes make it easier for external tools such as Oozie to integrate with Pig statistics. </p>
+
+<p>The Pig statistics are available here: <a href="http://pig.apache.org/docs/r0.9.0/api/">http://pig.apache.org/docs/r0.9.0/api/</a></p>
+
+<p>The stats classes are in the package: org.apache.pig.tools.pigstats</p>
+<ul>
+<li>PigStats</li>
+<li>JobStats</li>
+<li>OutputStats</li>
+<li>InputStats</li>
+</ul>
+<p></p>
+
+<p>The PigRunner class mimics the behavior of the Main class but gives users a statistics object back. Optionally, you can call the API with an implementation of progress listener which will be invoked by Pig runtime during the execution. </p>
+
+<source>
+package org.apache.pig;
+
+public abstract class PigRunner {
+    public static PigStats run(String[] args, PigProgressNotificationListener listener)
+}
+
+public interface PigProgressNotificationListener extends java.util.EventListener {
+    // just before the launch of MR jobs for the script
+    public void LaunchStartedNotification(int numJobsToLaunch);
+    // number of jobs submitted in a batch
+    public void jobsSubmittedNotification(int numJobsSubmitted);
+    // a job is started
+    public void jobStartedNotification(String assignedJobId);
+    // a job is completed successfully
+    public void jobFinishedNotification(JobStats jobStats);
+    // a job is failed
+    public void jobFailedNotification(JobStats jobStats);
+    // a user output is completed successfully
+    public void outputCompletedNotification(OutputStats outputStats);
+    // updates the progress as percentage
+    public void progressUpdatedNotification(int progress);
+    // the script execution is done
+    public void launchCompletedNotification(int numJobsSucceeded);
+}
+</source>
+
+
+</section>
+
+<section>
+<title>Job XML</title>
+<p>The following entries are included in job conf: </p>
+
+<table>
+<tr>
+<td>
+<p> <strong>Pig Statistic</strong> </p>
+</td>
+<td>
+<p> <strong>Description</strong></p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.script.id</p>
+</td>
+<td>
+<p>The UUID for the script. All jobs spawned by the script have the same script ID.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.script</p>
+</td>
+<td>
+<p>The base64 encoded script text.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.command.line</p>
+</td>
+<td>
+<p>The command line used to invoke the script.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.hadoop.version</p>
+</td>
+<td>
+<p>The Hadoop version installed.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.version</p>
+</td>
+<td>
+<p>The Pig version used.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.input.dirs</p>
+</td>
+<td>
+<p>A comma-separated list of input directories for the job.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.map.output.dirs</p>
+</td>
+<td>
+<p>A comma-separated list of output directories in the map phase of the job.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.reduce.output.dirs</p>
+</td>
+<td>
+<p>A comma-separated list of output directories in the reduce phase of the job.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.parent.jobid</p>
+</td>
+<td>
+<p>A comma-separated list of parent job ids.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.script.features</p>
+</td>
+<td>
+<p>A list of Pig features used in the script.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.job.feature</p>
+</td>
+<td>
+<p>A list of Pig features used in the job.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p>pig.alias</p>
+</td>
+<td>
+<p>The alias associated with the job.</p>
+</td>
+</tr>
+</table>
+</section>
+
+<section>
+<title>Hadoop Job History Loader</title>
+<p>The HadoopJobHistoryLoader in Piggybank loads Hadoop job history files and job xml files from file system. For each MapReduce job, the loader produces a tuple with schema (j:map[], m:map[], r:map[]). The first map in the schema contains job-related entries. Here are some of important key names in the map: </p>
+
+<table>
+<tr>
+<td>
+<p>PIG_SCRIPT_ID</p>
+<p>CLUSTER </p>
+<p>QUEUE_NAME</p>
+<p>JOBID</p>
+<p>JOBNAME</p>
+<p>STATUS</p>
+</td>
+<td>
+<p>USER </p>
+<p>HADOOP_VERSION  </p>
+<p>PIG_VERSION</p>
+<p>PIG_JOB_FEATURE</p>
+<p>PIG_JOB_ALIAS </p>
+<p>PIG_JOB_PARENTS</p>
+</td>
+<td>
+<p>SUBMIT_TIME</p>
+<p>LAUNCH_TIME</p>
+<p>FINISH_TIME</p>
+<p>TOTAL_MAPS</p>
+<p>TOTAL_REDUCES</p>
+</td>
+</tr>
+</table>
+<p></p>
+<p>Examples that use the loader to query Pig statistics are shown below.</p>
+</section>
+
+
+<section>
+<title>Examples</title>
+<p>Find scripts that generate more then three MapReduce jobs:</p>
+<source>
+a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
+b = group a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME');
+c = foreach b generate group.$1, group.$2, COUNT(a);
+d = filter c by $2 > 3;
+dump d;
+</source>
+
+<p>Find the running time of each script (in seconds): </p>
+<source>
+a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
+b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, 
+         (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
+c = group b by (id, user, script_name)
+d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
+dump d;
+</source>
+
+<p>Find the number of scripts run by user and queue on a cluster: </p>
+<source>
+a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
+b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue;
+c = group b by (id, user, queue) parallel 10;
+d = foreach c generate group.user, group.queue, COUNT(b);
+dump d;
+</source>
+
+<p>Find scripts that have failed jobs: </p>
+<source>
+a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
+b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, j#'JOBID' as job;
+c = filter b by status != 'SUCCESS';
+dump c;
+</source>
+
+<p>Find scripts that use only the default parallelism: </p>
+<source>
+a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
+b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
+c = group b by (id, user, script_name) parallel 10;
+d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
+e = filter d by max_reduces == 1;
+dump e;
+</source>
+</section>
+</section>   
+
+
+<!-- =========================================================================== -->
+<!-- PIGUNIT -->    
+
+  <section>
+      <title>PigUnit</title>
+      <p>PigUnit is a simple xUnit framework that enables you to easily test your Pig scripts.
+        With PigUnit you can perform unit testing, regression testing, and rapid prototyping. 
+        No cluster set up is required if you run Pig in local mode.
+      </p>
+
+    <section>
+      <title>Build PigUnit</title>
+      <p>To compile PigUnit run the command shown below from the Pig trunk. The compile will create the pigunit.jar file.</p>
+      <source>
+$pig_trunk ant pigunit-jar   
+</source>
+    </section>
+    
+      <section>
+      <title>Run PigUnit</title>
+      <p>You can run PigUnit using Pig's local mode or mapreduce mode.</p>
+    <section>
+      <title>Local Mode</title>
+      <p>
+        PigUnit runs in Pig's local mode by default.
+        Local mode is fast and enables you to use your local file system as the HDFS cluster.
+        Local mode does not require a real cluster but a new local one is created each time. 
+      </p>
+    </section>
+
+    <section>
+      <title>Mapreduce Mode</title>
+      <p>PigUnit also runs in Pig's mapreduce mode. Mapreduce mode requires you to use a Hadoop cluster and HDFS installation.
+        It is enabled when the Java system property pigunit.exectype.cluster is set to any value: e.g. -Dpigunit.exectype.cluster=true or System.getProperties().setProperty("pigunit.exectype.cluster", "true"). The cluster you select must be specified in the CLASSPATH (similar to the HADOOP_CONF_DIR variable). 
+      </p>
+    </section>
+
+    </section>
+
+    <section>
+      <title>PigUnit Example</title>
+      
+       <p>
+        Many PigUnit examples are available in the
+        <a href="http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/test/pigunit/TestPigTest.java">PigUnit tests</a>. 
+      </p>
+      
+      <p>The example included here computes the top N of the most common queries. 
+        The Pig script, top_queries.pig, is similar to the 
+        <a href="start.html#Pig+Script+1%3A+Query+Phrase+Popularity">Query Phrase Popularity</a> 
+        in the Pig tutorial. It expects an input a file of queries and a parameter n (n is 2 in our case in order to do a top 2). 
+      </p>
+      
+      <p>Setting up a test for this script is easy because the argument and the input data are
+        specified by two text arrays. It is the same for the expected output of the
+        script that will be compared to the actual result of the execution of the Pig script. 
+      </p>
+
+      <section>
+        <title>Java Test</title>
+        <source>
+  @Test
+  public void testTop2Queries() {
+    String[] args = {
+        "n=2",
+        };
+ 
+    PigTest test = new PigTest("top_queries.pig", args);
+ 
+    String[] input = {
+        "yahoo",
+        "yahoo",
+        "yahoo",
+        "twitter",
+        "facebook",
+        "facebook",
+        "linkedin",
+    };
+ 
+    String[] output = {
+        "(yahoo,3)",
+        "(facebook,2)",
+    };
+ 
+    test.assertOutput("data", input, "queries_limit", output);
+  }
+</source>
+      </section>
+
+      <section>
+        <title>top_queries.pig</title>
+        <source>
+data =
+    LOAD 'input'
+    AS (query:CHARARRAY);
+     
+queries_group =
+    GROUP data
+    BY query; 
+    
+queries_count = 
+    FOREACH queries_group 
+    GENERATE 
+        group AS query, 
+        COUNT(data) AS total;
+        
+queries_ordered =
+    ORDER queries_count
+    BY total DESC, query;
+            
+queries_limit =
+    LIMIT queries_ordered $n;
+
+STORE queries_limit INTO 'output';
+</source>
+      </section>
+
+      <section>
+        <title>Run</title>
+
+        <p>The test can be executed by JUnit (or any other Java testing framework). It requires:
+        </p>
+        <ol>
+          <li>pig.jar</li>
+          <li>pigunit.jar</li>
+        </ol>
+
+        <p>The test takes about 25s to run and should pass. In case of error (for example change the
+          parameter n to n=3), the diff of output is displayed:
+        </p>
+
+        <source>
+junit.framework.ComparisonFailure: null expected:&lt;...ahoo,3)
+(facebook,2)[]&gt; but was:&lt;...ahoo,3)
+(facebook,2)[
+(linkedin,1)]&gt;
+        at junit.framework.Assert.assertEquals(Assert.java:81)
+        at junit.framework.Assert.assertEquals(Assert.java:87)
+        at org.apache.pig.pigunit.PigTest.assertEquals(PigTest.java:272)
+</source>
+      </section>
+    </section>
+
+
+    <section>
+      <title>Troubleshooting Tips</title>
+      <p>Common problems you may encounter are discussed below.</p>
+      <section>
+        <title>Classpath in Mapreduce Mode</title>
+        <p>When using PigUnit in mapreduce mode, be sure to include the $HADOOP_CONF_DIR of the
+          cluster in your CLASSPATH.</p>
+        <p>
+          The default value is ~/pigtest/conf.
+        </p>
+        <source>
+org.apache.pig.backend.executionengine.ExecException: 
+ERROR 4010: Cannot find hadoop configurations in classpath 
+(neither hadoop-site.xml nor core-site.xml was found in the classpath).
+If you plan to use local mode, please put -x local option in command line
+</source>
+      </section>
+
+      <section>
+        <title>UDF jars Not Found</title>
+        <p>This error means that you are missing some jars in your test environment.</p>
+        <source>
+WARN util.JarManager: Couldn't find the jar for 
+org.apache.pig.piggybank.evaluation.string.LOWER, skip it
+</source>
+      </section>
+
+      <section>
+        <title>Storing Data</title>
+        <p>Pig currently drops all STORE and DUMP commands. You can tell PigUnit to keep the
+          commands and execute the script:</p>
+        <source>
+test = new PigTest(PIG_SCRIPT, args);   
+test.unoverride("STORE");
+test.runScript();
+</source>
+      </section>
+
+      <section>
+        <title>Cache Archive</title>
+        <p>For cache archive to work, your test environment needs to have the cache archive options
+          specified by Java properties or in an additional XML configuration in its CLASSPATH.</p>
+        <p>If you use a local cluster, you need to set the required environment variables before
+          starting it:</p>
+        <source>export LD_LIBRARY_PATH=/home/path/to/lib</source>
+      </section>
+    </section>
+
+    <section>
+      <title>Future Enhancements</title>
+      <p>Improvements and other components based on PigUnit that could be built later.</p>
+      <p>For example, we could build a PigTestCase and PigTestSuite on top of PigTest to:</p>
+      <ol>
+        <li>Add the notion of workspaces for each test.</li>
+        <li>Remove the boiler plate code appearing when there is more than one test methods.</li>
+        <li>Add a standalone utility that reads test configurations and generates a test report.
+        </li>
+      </ol>
+    </section>
+    </section>
+
+</body>
+</document>

Modified: pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml?rev=1050082&r1=1050081&r2=1050082&view=diff
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml (original)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml Thu Dec 16 18:10:59 2010
@@ -21,19 +21,23 @@
 
 <document>
 <header>
-<title>Pig UDF Manual</title>
+<title>User Defined Functions</title>
 </header>
 <body>
 
+<!-- ================================================================== -->
+<!-- WRITING UDFS -->
 
 <section>
-<title>Overview</title>
-<p>Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. 
-Functions can be a part of almost every operator in Pig. 
-This document describes how to use existing functions as well as how to write your own functions  
-(see also <a href="piglatin_ref1.html#Pig+Properties">Pig Properties</a>.)</p>
-</section>
+<title>Writing UDFs</title>
+
+<p>
+Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. 
+This section provides the information you need to write your own Java or Python UDFs. 
+The next section describes Piggy Bank, a repository that allows you to access and contribute Java UDFs written by you and other Pig users.
 
+</p>
+<!-- =============================================================== -->
 <section>
 <title>Eval Functions</title>
 
@@ -366,7 +370,7 @@ public class TOKENIZE extends EvalFunc&l
         try {
             DataBag output = mBagFactory.newDefaultBag();
             Object o = input.get(0);
-            if (!(o instanceof String)) {
+            if ((o instanceof String)) {
                 throw new IOException("Expected input to be chararray, but  got " + o.getClass().getName());
             }
             StringTokenizer tok = new StringTokenizer((String)o, " \",()*", false);
@@ -493,7 +497,7 @@ public class TOKENIZE extends EvalFunc&l
         try {
             DataBag output = mBagFactory.newDefaultBag();
             Object o = input.get(0);
-            if (!(o instanceof String)) {
+            if ((o instanceof String)) {
                 throw new IOException("Expected input to be chararray, but  got " + o.getClass().getName());
             }
             StringTokenizer tok = new StringTokenizer((String)o, " \",()*", false);
@@ -735,14 +739,15 @@ pig -cp sds.jar -Dudf.import.list=com.ya
 
 </section>
 
+<!-- =============================================================== -->
 <!-- BEGIN LOAD/STORE FUNCTIONS -->
 <section>
 <title> Load/Store Functions</title>
 
 <p>The load/store user-defined functions control how data goes into Pig and comes out of Pig. Often, the same function handles both input and output but that does not have to be the case. </p>
 <p>
-With Pig 0.7.0, the Pig load/store API moves closer to using Hadoop's InputFormat and OutputFormat classes.
-This enables Pig users/developers to create new LoadFunc and StoreFunc implementation based on existing Hadoop InputFormat and OutputFormat classes with minimal code. The complexity of reading the data and creating a record will now lie in the InputFormat and likewise on the writing end, the complexity of writing will lie in the OutputFormat. This enables Pig to easily read/write data in new storage formats as and when an Hadoop InputFormat and OutputFormat is available for them. </p>
+The Pig load/store API is aligned with Hadoop's InputFormat and OutputFormat classes.
+This enables you to create new LoadFunc and StoreFunc implementations based on existing Hadoop InputFormat and OutputFormat classes with minimal code. The complexity of reading the data and creating a record lies in the InputFormat while the complexity of writing the data lies in the OutputFormat. This enables Pig to easily read/write data in new storage formats as and when an Hadoop InputFormat and OutputFormat is available for them. </p>
 <p>
 <strong>Note:</strong> Both the LoadFunc and StoreFunc implementations should use the Hadoop 20 API based classes (InputFormat/OutputFormat and related classes) under the <strong>new</strong> org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package. 
 </p>
@@ -840,7 +845,7 @@ public class SimpleTextLoader extends Lo
     public Tuple getNext() throws IOException {
         try {
             boolean notDone = in.nextKeyValue();
-            if (!notDone) {
+            if (notDone) {
                 return null;
             }
             Text value = (Text) in.getCurrentValue();
@@ -903,6 +908,7 @@ public class SimpleTextLoader extends Lo
 </section>
 <!-- END LOAD FUNCTION -->
 
+<!-- =============================================================== -->
 <section>
 <title> Store Functions</title>
 
@@ -923,7 +929,7 @@ This interface has methods to interact w
 
 <p>The following methods have default implementations in StoreFunc and should be overridden only if necessary: </p>
 <ul>
-<li>setStoreFunc!UDFContextSignature(): This method will be called by Pig both in the front end and back end to pass a unique signature to the Storer. The signature can be used to store into the UDFContext any information which the Storer needs to store between various method invocations in the front end and back end. The default implementation in StoreFunc has an empty body. This method will be called before other methods. 
+<li>setStoreFuncUDFContextSignature(): This method will be called by Pig both in the front end and back end to pass a unique signature to the Storer. The signature can be used to store into the UDFContext any information which the Storer needs to store between various method invocations in the front end and back end. The default implementation in StoreFunc has an empty body. This method will be called before other methods. 
 </li>
 <li>relToAbsPathForStoreLocation(): Pig runtime will call this method to allow the Storer to convert a relative store location to an absolute location. An implementation is provided in StoreFunc which handles this for FileSystem based locations. </li>
 <li>checkSchema(): A Store function should implement this function to check that a given schema describing the data to be written is acceptable to it. The default implementation in StoreFunc has an empty body. This method will be called before any calls to setStoreLocation(). </li>
@@ -1137,15 +1143,7 @@ public class SimpleTextStorer extends St
 <!-- END LOAD/STORE FUNCTIONS -->
 
 
-<section>
-<title>Builtin Functions and Function Repositories</title>
-
-<p>Pig comes with a set of builtin functions. Two main properties differentiate builtin functions from UDFs. First, they don't need to be registered because Pig knows where they are. Second, they don't need to be qualified when used because Pig knows where to find them. </p>
-
-<p>Pig also hosts a UDF repository called <code>piggybank</code> that allows users to share UDFs that they have written. The details are described in <a href="http://wiki.apache.org/pig/PiggyBank"> PiggyBank</a>. </p>
-
-</section>
-
+<!-- =============================================================== -->
 <section>
 <title>Accumulator Interface</title>
 
@@ -1227,88 +1225,7 @@ public class IntMax extends EvalFunc&lt;
 
 </section>
 
-
-<section>
-<title>Advanced Topics</title>
-
-<section>
-<title>UDF Interfaces</title>
-<p>A UDF can be invoked multiple ways. The simplest UDF can just extend EvalFunc, which requires only the exec function to be implemented (see <a href="#How+to+Write+a+Simple+Eval+Function"> How to Write a Simple Eval Function</a>). Every eval UDF must implement this. Additionally, if a function is algebraic, it can implement <code>Algebraic</code> interface to significantly improve query performance in the cases when combiner can be used (see <a href="#Aggregate+Functions">Aggregate Functions</a>). Finally, a function that can process tuples in an incremental fashion can also implement the Accumulator interface to improve query memory consumption (see <a href="#Accumulator+Interface">Accumulator Interface</a>).
-</p>
-
-<p>The exact method by which UDF is invoked is selected by the optimizer based on the UDF type and the query. Note that only a single interface is used at any given time. The optimizer tries to find the most efficient way to execute the function. If a combiner is used and the function implements the Algebraic interface then this interface will be used to invoke the function. If the combiner is not invoked but the accumulator can be used and the function implements Accumulator interface then that interface is used. If neither of the conditions is satisfied then the exec function is used to invoke the UDF. 
-</p>
- </section>
- 
- 
-<section>
-<title>Function Instantiation</title>
-
-<p>One problem that users run into is when they make assumption about how many times a constructor for their UDF is called. For instance, they might be creating side files in the store function and doing it in the constructor seems like a good idea. The problem with this approach is that in most cases Pig instantiates functions on the client side to, for instance, examine the schema of the data.  </p>
-<p>Users should not make assumptions about how many times a function is instantiated; instead, they should make their code resilient to multiple instantiations. For instance, they could check if the files exist before creating them. </p>
-
-</section>
-
-<section>
-<title>Schemas</title>
-
-<p>One request from users is to have the ability to examine the input schema of the data before processing the data. For example, they would like to know how to convert an input tuple to a map such that the keys in the map are the names of the input columns. The current answer is that there is no way to do this. This is something we would like to support in the future. </p>
-
- </section>
- 
-
-
-<section>
-<title>Passing Configurations to UDFs</title>
-<p>The singleton UDFContext class provides two features to UDF writers. First, on the backend, it allows UDFs to get access to the JobConf object, by calling getJobConf. This is only available on the backend (at run time) as the JobConf has not yet been constructed on the front end (during planning time).</p>
-
-<p>Second, it allows UDFs to pass configuration information between instantiations of the UDF on the front and backends. UDFs can store information in a configuration object when they are constructed on the front end, or during other front end calls such as describeSchema. They can then read that information on the backend when exec (for EvalFunc) or getNext (for LoadFunc) is called. Note that information will not be passed between instantiations of the function on the backend. The communication channel only works from front end to back end.</p>
-
-<p>To store information, the UDF calls getUDFProperties. This returns a Properties object which the UDF can record the information in or read the information from. To avoid name space conflicts UDFs are required to provide a signature when obtaining a Properties object. This can be done in two ways. The UDF can provide its Class object (via this.getClass()). In this case, every instantiation of the UDF will be given the same Properties object. The UDF can also provide its Class plus an array of Strings. The UDF can pass its constructor arguments, or some other identifying strings. This allows each instantiation of the UDF to have a different properties object thus avoiding name space collisions between instantiations of the UDF.</p>
-</section>
-
-<section>
-<title>Monitoring long-running UDFs</title>
-<p>Sometimes one may discover that a UDF that executes very quickly in the vast majority of cases turns out to run exceedingly slowly on occasion. This can happen, for example, if a UDF uses complex regular expressions to parse free-form strings, or if a UDF uses some external service to communicate with. As of version 0.8, Pig provides a facility for monitoring the length of time a UDF is executing for every invocation, and terminating its execution if it runs too long. This facility can be turned on using a simple Java annotation:</p>
-	
-<source>
-	import org.apache.pig.builtin.MonitoredUDF;
-	
-	@MonitoredUDF
-	public class MyUDF extends EvalFunc&lt;Integer&gt; {
-	  /* implementation goes here */
-	}
-</source>
-
-<p>Simply annotating your UDF in this way will cause Pig to terminate the UDF's exec() method if it runs for more than 10 seconds, and return the default value of null. The duration of the timeout and the default value can be specified in the annotation, if desired:</p>
-
-<source>
-	import org.apache.pig.builtin.MonitoredUDF;
-	
-	@MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 100, intDefault = 10)
-	public class MyUDF extends EvalFunc&lt;Integer&gt; {
-	  /* implementation goes here */
-	}
-</source>
-
-<p>intDefault, longDefault, doubleDefault, floatDefault, and stringDefault can be specified in the annotation; the correct default will be chosen based on the return type of the UDF. Custom defaults for tuples and bags are not supported at this time.</p>
-
-<p>If desired, custom logic can also be implemented for error handling by creating a subclass of MonitoredUDFExecutor.ErrorCallback, and overriding its handleError and/or handleTimeout methods. Both of those methods are static, and are passed in the instance of the EvalFunc that produced an exception, as well as an exception, so you may use any state you have in the UDF to process the errors as desired. The default behavior is to increment Hadoop counters every time an error is encountered. Once you have an implementation of the ErrorCallback that performs your custom logic, you can provide it in the annotation:</p>
-
-<source>
-	import org.apache.pig.builtin.MonitoredUDF;
-
-	@MonitoredUDF(errorCallback=MySpecialErrorCallback.class)
-	public class MyUDF extends EvalFunc&lt;Integer&gt; {
-	  /* implementation goes here */
-	}
-</source>
-
-<p>Currently the MonitoredUDF annotation works with regular and Algebraic UDFs, but has no effect on UDFs that run in the Accumulator mode.</p>
-
-</section>
-</section>
-
+<!-- =============================================================== -->
 <section>
 <title>Python UDFs</title>
 <section>
@@ -1382,7 +1299,7 @@ b = foreach a generate myfuncs.helloworl
 <source>
 mySampleLib.py
 ---------------------
-#!/usr/bin/python
+#/usr/bin/python
 
 ##################
 # Math functions #
@@ -1431,9 +1348,165 @@ def collectBag(bag):
 # tuple in python are immutable, appending to a tuple is not possible.
 </source>
 </section>
+</section>
+
+<!-- =============================================================== -->
+<section>
+<title>Advanced Topics</title>
+
+<section>
+<title>UDF Interfaces</title>
+<p>A UDF can be invoked multiple ways. The simplest UDF can just extend EvalFunc, which requires only the exec function to be implemented (see <a href="#How+to+Write+a+Simple+Eval+Function"> How to Write a Simple Eval Function</a>). Every eval UDF must implement this. Additionally, if a function is algebraic, it can implement <code>Algebraic</code> interface to significantly improve query performance in the cases when combiner can be used (see <a href="#Aggregate+Functions">Aggregate Functions</a>). Finally, a function that can process tuples in an incremental fashion can also implement the Accumulator interface to improve query memory consumption (see <a href="#Accumulator+Interface">Accumulator Interface</a>).
+</p>
+
+<p>The exact method by which UDF is invoked is selected by the optimizer based on the UDF type and the query. Note that only a single interface is used at any given time. The optimizer tries to find the most efficient way to execute the function. If a combiner is used and the function implements the Algebraic interface then this interface will be used to invoke the function. If the combiner is not invoked but the accumulator can be used and the function implements Accumulator interface then that interface is used. If neither of the conditions is satisfied then the exec function is used to invoke the UDF. 
+</p>
+ </section>
+ 
+ 
+<section>
+<title>Function Instantiation</title>
+<p>One problem that users run into is when they make assumption about how many times a constructor for their UDF is called. For instance, they might be creating side files in the store function and doing it in the constructor seems like a good idea. The problem with this approach is that in most cases Pig instantiates functions on the client side to, for instance, examine the schema of the data.  </p>
+<p>Users should not make assumptions about how many times a function is instantiated; instead, they should make their code resilient to multiple instantiations. For instance, they could check if the files exist before creating them. </p>
+</section>
+
+<section>
+<title>Schemas</title>
+<p>One request from users is to have the ability to examine the input schema of the data before processing the data. For example, they would like to know how to convert an input tuple to a map such that the keys in the map are the names of the input columns. The current answer is that there is no way to do this. This is something we would like to support in the future. </p>
+ </section>
+ 
+<section>
+<title>Passing Configurations to UDFs</title>
+<p>The singleton UDFContext class provides two features to UDF writers. First, on the backend, it allows UDFs to get access to the JobConf object, by calling getJobConf. This is only available on the backend (at run time) as the JobConf has not yet been constructed on the front end (during planning time).</p>
+
+<p>Second, it allows UDFs to pass configuration information between instantiations of the UDF on the front and backends. UDFs can store information in a configuration object when they are constructed on the front end, or during other front end calls such as describeSchema. They can then read that information on the backend when exec (for EvalFunc) or getNext (for LoadFunc) is called. Note that information will not be passed between instantiations of the function on the backend. The communication channel only works from front end to back end.</p>
+
+<p>To store information, the UDF calls getUDFProperties. This returns a Properties object which the UDF can record the information in or read the information from. To avoid name space conflicts UDFs are required to provide a signature when obtaining a Properties object. This can be done in two ways. The UDF can provide its Class object (via this.getClass()). In this case, every instantiation of the UDF will be given the same Properties object. The UDF can also provide its Class plus an array of Strings. The UDF can pass its constructor arguments, or some other identifying strings. This allows each instantiation of the UDF to have a different properties object thus avoiding name space collisions between instantiations of the UDF.</p>
+</section>
+
+<section>
+<title>Monitoring long-running UDFs</title>
+<p>Sometimes one may discover that a UDF that executes very quickly in the vast majority of cases turns out to run exceedingly slowly on occasion. This can happen, for example, if a UDF uses complex regular expressions to parse free-form strings, or if a UDF uses some external service to communicate with. As of version 0.8, Pig provides a facility for monitoring the length of time a UDF is executing for every invocation, and terminating its execution if it runs too long. This facility can be turned on using a simple Java annotation:</p>
+	
+<source>
+	import org.apache.pig.builtin.MonitoredUDF;
+	
+	@MonitoredUDF
+	public class MyUDF extends EvalFunc&lt;Integer&gt; {
+	  /* implementation goes here */
+	}
+</source>
+
+<p>Simply annotating your UDF in this way will cause Pig to terminate the UDF's exec() method if it runs for more than 10 seconds, and return the default value of null. The duration of the timeout and the default value can be specified in the annotation, if desired:</p>
+
+<source>
+	import org.apache.pig.builtin.MonitoredUDF;
+	
+	@MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 100, intDefault = 10)
+	public class MyUDF extends EvalFunc&lt;Integer&gt; {
+	  /* implementation goes here */
+	}
+</source>
+
+<p>intDefault, longDefault, doubleDefault, floatDefault, and stringDefault can be specified in the annotation; the correct default will be chosen based on the return type of the UDF. Custom defaults for tuples and bags are not supported at this time.</p>
+
+<p>If desired, custom logic can also be implemented for error handling by creating a subclass of MonitoredUDFExecutor.ErrorCallback, and overriding its handleError and/or handleTimeout methods. Both of those methods are static, and are passed in the instance of the EvalFunc that produced an exception, as well as an exception, so you may use any state you have in the UDF to process the errors as desired. The default behavior is to increment Hadoop counters every time an error is encountered. Once you have an implementation of the ErrorCallback that performs your custom logic, you can provide it in the annotation:</p>
+
+<source>
+	import org.apache.pig.builtin.MonitoredUDF;
+
+	@MonitoredUDF(errorCallback=MySpecialErrorCallback.class)
+	public class MyUDF extends EvalFunc&lt;Integer&gt; {
+	  /* implementation goes here */
+	}
+</source>
+
+<p>Currently the MonitoredUDF annotation works with regular and Algebraic UDFs, but has no effect on UDFs that run in the Accumulator mode.</p>
+
+</section>
+</section>
+</section>
+
+<!-- ================================================================== -->
+<!-- PIGGYBANK -->
+<section>
+<title>Piggy Bank</title>
+<p>Piggy Bank is a place for Pig users to share the Java UDFs they have written for use with Pig. 
+The functions are contributed "as-is." 
+If you find a bug in a function, take the time to fix it and contribute the fix to Piggy Bank. 
+If you don't find the UDF you need, take the time to write and contribute the function to Piggy Bank.
+</p>
+
+<p><strong>Note:</strong> Piggy Bank currently supports Java UDFs. Support for Python UDFs will be added at a later date.</p>
 
+<section>
+<title>Accessing Functions</title>
+
+<p>The Piggy Bank functions are currently distributed in source form. Users are required to checkout the code and build the package themselves. No binary distributions or nightly builds are available at this time. </p>
+
+<p>To build a jar file that contains all available UDFs, follow these steps: </p>
+<ul>
+<li>Checkout UDF code: <code>svn co http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank</code> </li>
+<li>Add pig.jar to your ClassPath: <code>export CLASSPATH=$CLASSPATH:/path/to/pig.jar</code> </li>
+<li>Build the jar file: from directory<code>trunk/contrib/piggybank/java</code> run <code>ant</code>. 
+This will generate <code>piggybank.jar</code> in the same directory. </li>
+</ul>
+<p></p>
+
+<p>Make sure your classpath includes the hadoop jars as well. This worked for me using the Cloudera CDH2 / Hadoop AMIs: </p>
+
+<source>
+pig_version=0.4.99.0+10   ; pig_dir=/usr/lib/pig ;
+hadoop_version=0.20.1+152 ; hadoop_dir=/usr/lib/hadoop ;
+export CLASSPATH=$CLASSPATH:${hadoop_dir}/hadoop-${hadoop_version}-core.jar: 
+    ${hadoop_dir}/hadoop-${hadoop_version}-tools.jar:  
+    ${hadoop_dir}/hadoop-${hadoop_version}-ant.jar:
+    ${hadoop_dir}/lib/commons-logging-1.0.4.jar:
+    ${pig_dir}/pig-${pig_version}-core.jar
+</source>
+
+<p>To obtain <code>javadoc</code> description of the functions run <code>ant javadoc</code> from directory <code>trunk/contrib/piggybank/java</code>. The documentation is generate in directory <code>trunk/contrib/piggybank/java/build/javadoc</code>.</p>
+
+<p>To use a function, you need to determine which package it belongs to. The top level packages correspond to the function type and currently are: </p>
+<ul>
+<li>org.apache.pig.piggybank.comparison - for custom comparator used by ORDER operator </li>
+<li>org.apache.pig.piggybank.evaluation - for eval functions like aggregates and column transformations </li>
+<li>org.apache.pig.piggybank.filtering - for functions used in FILTER operator </li>
+<li>org.apache.pig.piggybank.grouping - for grouping functions</li>
+<li>org.apache.pig.piggybank.storage - for load/store functions </li>
+</ul>
+<p></p>
+
+<p>(The exact package of the function can be seen in the javadocs or by navigating the source tree.) </p>
+
+<p>For example, to use the UPPER command: </p>
+
+<source>
+REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
+TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text) 
+    MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
+STORE TweetsInaug INTO 'meta/inaug/tweets_inaug' ;
+</source>
+</section>
+
+<section>
+<title>Contributing Functions</title>
+
+<p>To contribute a Java function that you have written, do the following:</p>
+<ol>
+<li>Check the existing javadoc to make sure that the function does not already exist as described in <a href="#Accessing+Functions">Accessing Functions</a>. </li>
+<li>Checkout the UDF code as described in <a href="#Accessing+Functions">Accessing Functions</a>. </li>
+<li>Place your java code in the directory that makes sense for your function. The directory structure currently has two levels: (1) function type, as described in <a href="#Accessing+Functions">Accessing Functions</a>, and (2) function subtype, for some of the types (like math or string for eval functions). If you think your function requires a new subtype, feel free to add one. </li>
+<li>Make sure that your function is well documented and uses the 
+<a href="http://download.oracle.com/javase/1.4.2/docs/tooldocs/solaris/javadoc.html">javadoc</a> style of documentation. </li>
+<li>Make sure that your code follows Pig coding conventions described in <a href="http://wiki.apache.org/pig/HowToContribute">How to Contribute to Pig</a>.</li>
+<li>Make sure that for each function, you add a corresponding test class in the test part of the tree. </li>
+<li>Submit your patch following the process described in <a href="http://wiki.apache.org/pig/HowToContribute">How to Contribute to Pig</a>. </li>
+</ol>
 </section>
 
+</section> 
+
 </body>
 </document>
 

Modified: pig/trunk/src/docs/src/documentation/content/xdocs/zebra_pig.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/zebra_pig.xml?rev=1050082&r1=1050081&r2=1050082&view=diff
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/zebra_pig.xml (original)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/zebra_pig.xml Thu Dec 16 18:10:59 2010
@@ -64,8 +64,8 @@ A: {name: chararray,age: int,gpa: float}
 </source>
    
 <p>You can provide alternative names to the columns with the AS clause. You can also provide alternative types as long as the 
- original type can be converted to the new type. (One exception to this rule are maps since you can't specify schema for a map. Zebra always creates map values as bytearrays which would require casting to real type in the script. Note that this is not different for treating maps in Pig for any other storage.) For more information see <a href="piglatin_ref2.html#Schemas">Schemas</a> and
-<a href="piglatin_ref2.html#Arithmetic+Operators+and+More">Arithmetic Operators and More</a>.
+ original type can be converted to the new type. (One exception to this rule are maps since you can't specify schema for a map. Zebra always creates map values as bytearrays which would require casting to real type in the script. Note that this is not different for treating maps in Pig for any other storage.) For more information see <a href="basic.html#Schemas">Schemas</a> and
+<a href="basic.html#Arithmetic+Operators+and+More">Arithmetic Operators and More</a>.
  </p>
  
 <p>You can provide multiple, comma-separated files to the loader:</p>
@@ -106,7 +106,7 @@ C = FOREACH B GENERATE group, MAX(a.$1);
     <section>
    <title>Sorting Data</title>
    <p>
-   Pig allows you to sort data by ascending (ASC) or descending (DESC) order (for more information, see <a href="piglatin_ref2.html#ORDER">ORDER</a>). Currently, Zebra supports tables that are sorted in ascending order. Zebra does not support tables that are sorted in descending order; if Zebra encounters a table to be stored that is sorted in descending order, Zebra will issue a warning and store the table as an unsorted table.</p>
+   Pig allows you to sort data by ascending (ASC) or descending (DESC) order (for more information, see <a href="basic.html#ORDER+BY">ORDER BY</a>). Currently, Zebra supports tables that are sorted in ascending order. Zebra does not support tables that are sorted in descending order; if Zebra encounters a table to be stored that is sorted in descending order, Zebra will issue a warning and store the table as an unsorted table.</p>
      </section>
      <!--end sorting data-->
      



Mime
View raw message