pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From o...@apache.org
Subject svn commit: r1050082 [5/6] - in /pig/trunk: ./ src/docs/src/documentation/content/xdocs/
Date Thu, 16 Dec 2010 18:10:59 GMT
Added: pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml?rev=1050082&view=auto
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml (added)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml Thu Dec 16 18:10:59 2010
@@ -0,0 +1,1008 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
+<document>
+  <header>
+    <title>Performance and Efficiency</title>
+  </header>
+  <body> 
+  
+<!-- ================================================================== -->
+<!-- MEMORY MANAGEMENT -->
+<section>
+<title>Memory Management</title>
+
+<p>Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner. </p>
+
+<p>The amount of memory allocated to bags is determined by pig.cachedbag.memusage; the default is set to 10% of available memory. Note that this memory is shared across all large bags used by the application.</p>
+
+</section> 
+
+
+<!-- ==================================================================== -->
+<!-- MULTI-QUERY EXECUTION-->
+<section>
+<title>Multi-Query Execution</title>
+<p>With multi-query execution Pig processes an entire script or a batch of statements at once.</p>
+
+<section>
+	<title>Turning it On or Off</title>	
+	<p>Multi-query execution is turned on by default. 
+	To turn it off and revert to Pig's "execute-on-dump/store" behavior, use the "-M" or "-no_multiquery" options. </p>
+	<p>To run script "myscript.pig" without the optimization, execute Pig as follows: </p>
+<source>
+$ pig -M myscript.pig
+or
+$ pig -no_multiquery myscript.pig
+</source>
+</section>
+
+<section>
+<title>How it Works</title>
+<p>Multi-query execution introduces some changes:</p>
+
+<ul>
+<li>
+<p>For batch mode execution, the entire script is first parsed to determine if intermediate tasks 
+can be combined to reduce the overall amount of work that needs to be done; execution starts only after the parsing is completed 
+(see the <a href="test.html#EXPLAIN">EXPLAIN</a> operator and the <a href="cmds.html#run">run</a> and <a href="cmds.html#exec">exec</a> commands). </p>
+
+</li>
+<li>
+<p>Two run scenarios are optimized, as explained below: explicit and implicit splits, and storing intermediate results.</p>
+</li>
+</ul>
+
+<section>
+	<title>Explicit and Implicit Splits</title>
+<p>There might be cases in which you want different processing on separate parts of the same data stream.</p>
+<p>Example 1:</p>
+<source>
+A = LOAD ...
+...
+SPLIT A' INTO B IF ..., C IF ...
+...
+STORE B' ...
+STORE C' ...
+</source>
+<p>Example 2:</p>
+<source>
+A = LOAD ...
+...
+B = FILTER A' ...
+C = FILTER A' ...
+...
+STORE B' ...
+STORE C' ...
+</source>
+<p>In prior Pig releases, Example 1 will dump A' to disk and then start jobs for B' and C'. 
+Example 2 will execute all the dependencies of B' and store it and then execute all the dependencies of C' and store it. 
+Both are equivalent, but the performance will be different. </p>
+<p>Here's what the multi-query execution does to increase the performance: </p>
+	<ul>
+		<li><p>For Example 2, adds an implicit split to transform the query to Example 1. 
+		This eliminates the processing of A' multiple times.</p></li>
+		<li><p>Makes the split non-blocking and allows processing to continue. 
+		This helps reduce the amount of data that has to be stored right at the split.  </p></li>
+		<li><p>Allows multiple outputs from a job. This way some results can be stored as a side-effect of the main job. 
+		This is also necessary to make the previous item work.  </p></li>
+		<li><p>Allows multiple split branches to be carried on to the combiner/reducer. 
+		This reduces the amount of IO again in the case where multiple branches in the split can benefit from a combiner run. </p></li>
+	</ul>
+</section>
+
+<section>
+	<title>Storing Intermediate Results</title>
+<p>Sometimes it is necessary to store intermediate results. </p>
+
+<source>
+A = LOAD ...
+...
+STORE A'
+...
+STORE A''
+</source>
+
+<p>If the script doesn't re-load A' for the processing of A the steps above A' will be duplicated. 
+This is a special case of Example 2 above, so the same steps are recommended. 
+With multi-query execution, the script will process A and dump A' as a side-effect.</p>
+</section>
+</section>
+
+
+<!-- ++++++++++++++++++++++++++++++++++++++++++ -->
+<section>
+	<title>Store vs. Dump</title>
+	<p>With multi-query exection, you want to use <a href="basic.html#STORE">STORE</a> to save (persist) your results. 
+	You do not want to use <a href="test.html#DUMP">DUMP</a> as it will disable multi-query execution and is likely to slow down execution. (If you have included DUMP statements in your scripts for debugging purposes, you should remove them.) </p>
+	
+	<p>DUMP Example: In this script, because the DUMP command is interactive, the multi-query execution will be disabled and two separate jobs will be created to execute this script. The first job will execute A > B > DUMP while the second job will execute A > B > C > STORE.</p>
+	
+<source>
+A = LOAD 'input' AS (x, y, z);
+B = FILTER A BY x > 5;
+DUMP B;
+C = FOREACH B GENERATE y, z;
+STORE C INTO 'output';
+</source>
+	
+	<p>STORE Example: In this script, multi-query optimization will kick in allowing the entire script to be executed as a single job. Two outputs are produced: output1 and output2.</p>
+	
+<source>
+A = LOAD 'input' AS (x, y, z);
+B = FILTER A BY x > 5;
+STORE B INTO 'output1';
+C = FOREACH B GENERATE y, z;
+STORE C INTO 'output2';	
+</source>
+
+</section>
+<section>
+	<title>Error Handling</title>
+	<p>With multi-query execution Pig processes an entire script or a batch of statements at once. 
+	By default Pig tries to run all the jobs that result from that, regardless of whether some jobs fail during execution. 
+	To check which jobs have succeeded or failed use one of these options. </p>
+	
+	<p>First, Pig logs all successful and failed store commands. Store commands are identified by output path. 
+	At the end of execution a summary line indicates success, partial failure or failure of all store commands. </p>	
+	
+	<p>Second, Pig returns different code upon completion for these scenarios:</p>
+	<ul>
+		<li><p>Return code 0: All jobs succeeded</p></li>
+		<li><p>Return code 1: <em>Used for retrievable errors</em> </p></li>
+		<li><p>Return code 2: All jobs have failed </p></li>
+		<li><p>Return code 3: Some jobs have failed  </p></li>
+	</ul>
+	<p></p>
+	<p>In some cases it might be desirable to fail the entire script upon detecting the first failed job. 
+	This can be achieved with the "-F" or "-stop_on_failure" command line flag. 
+	If used, Pig will stop execution when the first failed job is detected and discontinue further processing. 
+	This also means that file commands that come after a failed store in the script will not be executed (this can be used to create "done" files). </p>
+	
+	<p>This is how the flag is used: </p>
+<source>
+$ pig -F myscript.pig
+or
+$ pig -stop_on_failure myscript.pig
+</source>
+</section>
+
+<section>
+	<title>Backward Compatibility</title>
+	
+	<p>Most existing Pig scripts will produce the same result with or without the multi-query execution. 
+	There are cases though where this is not true. Path names and schemes are discussed here.</p>
+	
+	<p>Any script is parsed in it's entirety before it is sent to execution. Since the current directory can change 
+	throughout the script any path used in LOAD or STORE statement is translated to a fully qualified and absolute path.</p>
+		
+	<p>In map-reduce mode, the following script will load from "hdfs://&lt;host&gt;:&lt;port&gt;/data1" and store into "hdfs://&lt;host&gt;:&lt;port&gt;/tmp/out1". </p>
+<source>
+cd /;
+A = LOAD 'data1';
+cd tmp;
+STORE A INTO 'out1';
+</source>
+
+	<p>These expanded paths will be passed to any LoadFunc or Slicer implementation. 
+	In some cases this can cause problems, especially when a LoadFunc/Slicer is not used to read from a dfs file or path 
+	(for example, loading from an SQL database). </p>
+	
+	<p>Solutions are to either: </p>
+	<ul>
+		<li><p>Specify "-M" or "-no_multiquery" to revert to the old names</p></li>
+		<li><p>Specify a custom scheme for the LoadFunc/Slicer </p></li>
+	</ul>	
+	
+	<p>Arguments used in a LOAD statement that have a scheme other than "hdfs" or "file" will not be expanded and passed to the LoadFunc/Slicer unchanged.</p>
+	<p>In the SQL case, the SQLLoader function is invoked with 'sql://mytable'. </p>
+
+<source>
+A = LOAD 'sql://mytable' USING SQLLoader();
+</source>
+</section>
+
+<section>
+	<title>Implicit Dependencies</title>
+<p>If a script has dependencies on the execution order outside of what Pig knows about, execution may fail. </p>
+
+
+<section>
+	<title>Example</title>
+<p>In this script, MYUDF might try to read from out1, a file that A was just stored into. 
+However, Pig does not know that MYUDF depends on the out1 file and might submit the jobs 
+producing the out2 and out1 files at the same time.</p>
+<source>
+...
+STORE A INTO 'out1';
+B = LOAD 'data2';
+C = FOREACH B GENERATE MYUDF($0,'out1');
+STORE C INTO 'out2';
+</source>
+
+<p>To make the script work (to ensure that the right execution order is enforced) add the exec statement. 
+The exec statement will trigger the execution of the statements that produce the out1 file. </p>
+
+<source>
+...
+STORE A INTO 'out1';
+EXEC;
+B = LOAD 'data2';
+C = FOREACH B GENERATE MYUDF($0,'out1');
+STORE C INTO 'out2';
+</source>
+</section>
+
+<section>
+	<title>Example</title>
+<p>In this script, the STORE/LOAD operators have different file paths; however, the LOAD operator depends on the STORE operator.</p>
+<source>
+A = LOAD '/user/xxx/firstinput' USING PigStorage();
+B = group ....
+C = .... agrregation function
+STORE C INTO '/user/vxj/firstinputtempresult/days1';
+..
+Atab = LOAD '/user/xxx/secondinput' USING  PigStorage();
+Btab = group ....
+Ctab = .... agrregation function
+STORE Ctab INTO '/user/vxj/secondinputtempresult/days1';
+..
+E = LOAD '/user/vxj/firstinputtempresult/' USING  PigStorage();
+F = group ....
+G = .... aggregation function
+STORE G INTO '/user/vxj/finalresult1';
+
+Etab =LOAD '/user/vxj/secondinputtempresult/' USING  PigStorage();
+Ftab = group ....
+Gtab = .... aggregation function
+STORE Gtab INTO '/user/vxj/finalresult2';
+</source>
+
+<p>To make the script works, add the exec statement.  </p>
+
+<source>
+A = LOAD '/user/xxx/firstinput' USING PigStorage();
+B = group ....
+C = .... agrregation function
+STORE C INTO '/user/vxj/firstinputtempresult/days1';
+..
+Atab = LOAD '/user/xxx/secondinput' USING  PigStorage();
+Btab = group ....
+Ctab = .... agrregation function
+STORE Ctab INTO '/user/vxj/secondinputtempresult/days1';
+
+EXEC;
+
+E = LOAD '/user/vxj/firstinputtempresult/' USING  PigStorage();
+F = group ....
+G = .... aggregation function
+STORE G INTO '/user/vxj/finalresult1';
+..
+Etab =LOAD '/user/vxj/secondinputtempresult/' USING  PigStorage();
+Ftab = group ....
+Gtab = .... aggregation function
+STORE Gtab INTO '/user/vxj/finalresult2';
+</source>
+</section>
+</section>
+</section>
+
+
+<!-- ==================================================================== -->
+ <!-- OPTIMIZATION RULES -->
+<section>
+<title>Optimization Rules</title>
+<p>Pig supports various optimization rules. By default optimization, and all optimization rules, are turned on. 
+To turn off optimiztion, use:</p>
+
+<source>
+pig -optimizer_off [opt_rule | all ]
+</source>
+
+<p>Note that some rules are mandatory and cannot be turned off.</p>
+
+<!-- +++++++++++++++++++++++++++++++ -->
+<section>
+<title>ImplicitSplitInserter</title>
+<p>Status: Mandatory</p>
+<p>
+<a href="basic.html#SPLIT">SPLIT</a> is the only operator that models multiple outputs in Pig. 
+To ease the process of building logical plans, all operators are allowed to have multiple outputs. As part of the 
+optimization, all non-split operators that have multiple outputs are altered to have a SPLIT operator as the output 
+and the outputs of the operator are then made outputs of the SPLIT operator. An example will illustrate the point. 
+Here, a split will be inserted after the LOAD and the split outputs will be connected to the FILTER (b) and the COGROUP (c).
+</p>
+<source>
+A = LOAD 'input';
+B = FILTER A BY $1 == 1;
+C = COGROUP A BY $0, B BY $0;
+</source>
+</section>
+
+<!-- +++++++++++++++++++++++++++++++ -->
+<section>
+<title>LogicalExpressionSimplifier</title>
+<p>This rule contains several types of simplifications.</p>
+
+<source>
+1) Constant pre-calculation 
+
+B = FILTER A BY a0 &gt; 5+7; 
+is simplified to 
+B = FILTER A BY a0 &gt; 12; 
+
+2) Elimination of negations 
+
+B = FILTER A BY NOT (NOT(a0 &gt; 5) OR a &gt; 10); 
+is simplified to 
+B = FILTER A BY a0 &gt; 5 AND a &lt;= 10; 
+
+3) Elimination of logical implied expression in AND 
+
+B = FILTER A BY (a0 &gt; 5 AND a0 &gt; 7); 
+is simplified to 
+B = FILTER A BY a0 &gt; 7; 
+
+4) Elimination of logical implied expression in OR 
+
+B = FILTER A BY ((a0 &gt; 5) OR (a0 &gt; 6 AND a1 &gt; 15); 
+is simplified to 
+B = FILTER C BY a0 &gt; 5; 
+
+5) Equivalence elimination 
+
+B = FILTER A BY (a0 v 5 AND a0 &gt; 5); 
+is simplified to 
+B = FILTER A BY a0 &gt; 5; 
+
+6) Elimination of complementary expressions in OR 
+
+B = FILTER A BY (a0 &gt; 5 OR a0 &lt;= 5); 
+is simplified to non-filtering 
+
+7) Elimination of naive TRUE expression 
+
+B = FILTER A BY 1==1; 
+is simplified to non-filtering 
+</source>
+</section>
+
+
+<!-- +++++++++++++++++++++++++++++++ -->
+<section>
+<title>MergeForEach</title>
+<p>The objective of this rule is to merge together two feach statements, if these preconditions are met:</p>
+<ul>
+	<li>The foreach statements are consecutive. </li>
+	<li>The first foreach statement does not contain flatten. </li>
+	<li>The second foreach is not nested. </li>
+</ul>
+<source>
+-- Original code: 
+
+A = LOAD 'file.txt' AS (a, b, c); 
+B = FOREACH A GENERATE a+b AS u, c-b AS v; 
+C = FOREACH B GENERATE $0+5, v; 
+
+-- Optimized code: 
+
+A = LOAD 'file.txt' AS (a, b, c); 
+C = FOREACH A GENERATE a+b+5, c-b; 
+</source>
+</section>
+
+
+<!-- +++++++++++++++++++++++++++++++ -->
+<section>
+<title>OpLimitOptimizer</title>
+<p>
+The objective of this rule is to push the <a href="basic.html#LIMIT">LIMIT</a> operator up the data flow graph 
+(or down the tree for database folks). In addition, for top-k (ORDER BY followed by a LIMIT) the LIMIT is pushed into the ORDER BY.
+</p>
+<source>
+A = LOAD 'input';
+B = ORDER A BY $0;
+C = LIMIT B 10;
+</source>
+</section>
+
+<section>
+<title>PushUpFilters</title>
+<p>
+The objective of this rule is to push the <a href="basic.html#FILTER">FILTER</a> operators up the data flow graph. 
+As a result, the number of records that flow through the pipeline is reduced. 
+</p>
+<source>
+A = LOAD 'input';
+B = GROUP A BY $0;
+C = FILTER B BY $0 &lt; 10;
+</source>
+</section>
+
+<!-- +++++++++++++++++++++++++++++++ -->
+<section>
+<title>PushDownExplodes</title>
+<p>
+The objective of this rule is to reduce the number of records that flow through the pipeline by moving 
+<a href="basic.html#FOREACH">FOREACH</a> operators with a 
+<a href="basic.html#Flatten+Operator">FLATTEN</a> down the data flow graph. 
+In the example shown below, it would be more efficient to move the foreach after the join to reduce the cost of the join operation.
+</p>
+<source>
+A = LOAD 'input' AS (a, b, c);
+B = LOAD 'input2' AS (x, y, z);
+C = FOREACH A GENERATE FLATTEN($0), B, C;
+D = JOIN C BY $1, B BY $1;
+</source>
+</section>
+
+<!-- +++++++++++++++++++++++++++++++ -->
+<section>
+<title>StreamOptimizer</title>
+<p>
+Optimize when <a href="basic.html#LOAD">LOAD</a> precedes <a href="basic.html#STREAM">STREAM</a> 
+and the loader class is the same as the serializer for the stream. Similarly, optimize when STREAM is followed by 
+<a href="basic.html#STORE">STORE</a> and the deserializer class is same as the storage class. 
+For both of these cases the optimization is to replace the loader/serializer with BinaryStorage which just moves bytes 
+around and to replace the storer/deserializer with BinaryStorage.
+</p>
+</section>
+
+<!-- +++++++++++++++++++++++++++++++ -->
+<section>
+<title>TypeCastInserter</title>
+<p>Status: Mandatory</p>
+<p>
+If you specify a <a href="basic.html#Schemas">schema</a> with the 
+<a href="basic.html#LOAD">LOAD</a> statement, the optimizer will perform a pre-fix projection of the columns 
+and <a href="basic.html#Cast+Operators">cast</a> the columns to the appropriate types. An example will illustrate the point. 
+The LOAD statement (a) has a schema associated with it. The optimizer will insert a FOREACH operator that will project columns 0, 1 and 2 
+and also cast them to chararray, int and float respectively. 
+</p>
+<source>
+A = LOAD 'input' AS (name: chararray, age: int, gpa: float);
+B = FILER A BY $1 == 1;
+C = GROUP A By $0;
+</source>
+</section>
+</section>
+
+  
+<!-- ==================================================================== -->
+<!-- PERFORMANCE ENHANCERS-->
+<section>
+<title>Performance Enhancers</title>
+
+<section>
+<title>Use Optimization</title>
+<p>Pig supports various <a href="perf.html#Optimization+Rules">optimization rules</a> which are turned on by default. 
+Become familiar with these rules.</p>
+</section>
+
+<section>
+<title>Use Types</title>
+
+<p>If types are not specified in the load statement, Pig assumes the type of =double= for numeric computations. 
+A lot of the time, your data would be much smaller, maybe, integer or long. Specifying the real type will help with 
+speed of arithmetic computation. It has an additional advantage of early error detection. </p>
+
+<source>
+--Query 1
+A = load 'myfile' as (t, u, v);
+B = foreach A generate t + u;
+
+--Query 2
+A = load 'myfile' as (t: int, u: int, v);
+B = foreach A generate t + u;
+</source>
+
+<p>The second query will run more efficiently than the first. In some of our queries with see 2x speedup. </p>
+</section>
+
+<section>
+<title>Project Early and Often </title>
+
+<p>Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like: </p>
+
+<source>
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+C = join A by t, B by x;
+D = group C by u;
+E = foreach D generate group, COUNT($1);
+</source>
+
+<p>There is no need for v, y, or z to participate in this query.  And there is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig. </p>
+
+<source>
+A = load 'myfile' as (t, u, v);
+A1 = foreach A generate t, u;
+B = load 'myotherfile' as (x, y, z);
+B1 = foreach B generate x;
+C = join A1 by t, B1 by x;
+C1 = foreach C generate t, u;
+D = group C1 by u;
+E = foreach D generate group, COUNT($1);
+</source>
+
+<p>Depending on your data, this can produce significant time savings. In queries similar to the example shown here we have seen total time drop by 50%.</p>
+</section>
+
+<section>
+<title>Filter Early and Often</title>
+
+<p>As with early projection, in most cases it is beneficial to apply filters as early as possible to reduce the amount of data flowing through the pipeline. </p>
+
+<source>
+-- Query 1
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+C = filter A by t == 1;
+D = join C by t, B by x;
+E = group D by u;
+F = foreach E generate group, COUNT($1);
+
+-- Query 2
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+C = join A by t, B by x;
+D = group C by u;
+E = foreach D generate group, COUNT($1);
+F = filter E by C.t == 1;
+</source>
+
+<p>The first query is clearly more efficient than the second one because it reduces the amount of data going into the join. </p>
+
+<p>One case where pushing filters up might not be a good idea is if the cost of applying filter is very high and only a small amount of data is filtered out. </p>
+
+</section>
+
+<section>
+<title>Reduce Your Operator Pipeline</title>
+
+<p>For clarity of your script, you might choose to split your projects into several steps for instance: </p>
+
+<source>
+A = load 'data' as (in: map[]);
+-- get key out of the map
+B = foreach A generate in#k1 as k1, in#k2 as k2;
+-- concatenate the keys
+C = foreach B generate CONCAT(k1, k2);
+.......
+</source>
+<p>While the example above is easier to read, you might want to consider combining the two foreach statements to improve your query performance: </p>
+
+<source>
+A = load 'data' as (in: map[]);
+-- concatenate the keys from the map
+B = foreach A generate CONCAT(in#k1, in#k2);
+....
+</source>
+
+<p>The same goes for filters. </p>
+
+</section>
+
+<section>
+<title>Make Your UDFs Algebraic</title>
+
+<p>Queries that can take advantage of the combiner generally ran much faster (sometimes several times faster) than the versions that don't. The latest code significantly improves combiner usage; however, you need to make sure you do your part. If you have a UDF that works on grouped data and is, by nature, algebraic (meaning their computation can be decomposed into multiple steps) make sure you implement it as such. For details on how to write algebraic UDFs, see the Pig UDF Manual and <a href="udf.html#Aggregate+Functions">Aggregate Functions</a>.</p>
+
+<source>
+A = load 'data' as (x, y, z)
+B = group A by x;
+C = foreach B generate group, MyUDF(A);
+....
+</source>
+
+<p>If <code>MyUDF</code> is algebraic, the query will use combiner and run much faster. You can run <code>explain</code> command on your query to make sure that combiner is used. </p>
+</section>
+
+<section>
+<title>Implement the Aggregator Interface</title>
+<p>
+If your UDF can't be made Algebraic but is able to deal with getting input in chunks rather than all at once, consider implementing the Aggregator interface to reduce the amount of memory used by your script.If your function <em>is</em> Algebraic and can be used on conjunction with Accumulator functions, you will need to implement the Accumulator interface as well as the Algebraic interface. For more information, see the Pig UDF Manual and <a href="udf.html#Accumulator+Interface">Accumulator Interface</a>.
+</p>
+</section>
+
+
+<section>
+<title>Drop Nulls Before a Join</title>
+<p>With the introduction of nulls, join and cogroup semantics were altered to work with nulls. The semantic for cogrouping with nulls is that nulls from a given input are grouped together, but nulls across inputs are not grouped together. This preserves the semantics of grouping (nulls are collected together from a single input to be passed to aggregate functions like COUNT) and the semantics of join (nulls are not joined across inputs). Since flattening an empty bag results in an empty row (and no output), in a standard join the rows with a null key will always be dropped. </p>
+
+<p>This join</p>
+<source>
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+C = join A by t, B by x;
+</source>
+
+<p>is rewritten by Pig to </p>
+<source>
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+C1 = cogroup A by t INNER, B by x INNER;
+C = foreach C1 generate flatten(A), flatten(B);
+</source>
+
+<p>Since the nulls from A and B won't be collected together, when the nulls are flattened we're guaranteed to have an empty bag, which will result in no output. So the null keys will be dropped. But they will not be dropped until the last possible moment. </p> 
+
+<p>If the query is rewritten to </p>
+<source>
+A = load 'myfile' as (t, u, v);
+B = load 'myotherfile' as (x, y, z);
+A1 = filter A by t is not null;
+B1 = filter B by x is not null;
+C = join A1 by t, B1 by x;
+</source>
+
+<p>then the nulls will be dropped before the join.  Since all null keys go to a single reducer, if your key is null even a small percentage of the time the gain can be significant.  In one test where the key was null 7% of the time and the data was spread across 200 reducers, we saw a about a 10x speed up in the query by adding the early filters. </p>
+
+</section>
+
+<section>
+<title>Take Advantage of Join Optimizations</title>
+<p><strong>Regular Join Optimizations</strong></p>
+<p>Optimization for regular joins ensures that the last table in the join is not brought into memory but streamed through instead. Optimization reduces the amount of memory used which means you can avoid spilling the data and also should be able to scale your query to larger data volumes. </p>
+<p>To take advantage of this optimization, make sure that the table with the largest number of tuples per key is the last table in your query. 
+In some of our tests we saw 10x performance improvement as the result of this optimization.</p>
+<source>
+small = load 'small_file' as (t, u, v);
+large = load 'large_file' as (x, y, z);
+C = join small by t, large by x;
+</source>
+
+<p><strong>Specialized Join Optimizations</strong></p>
+<p>Optimization can also be achieved using fragment replicate joins, skewed joins, and merge joins. 
+For more information see <a href="perf.html#Specialized+Joins">Specialized Joins</a>.</p>
+
+</section>
+
+
+<section>
+<title>Use the Parallel Features</title>
+
+<p>You can set the number of reduce tasks for the MapReduce jobs generated by Pig using two parallel features. 
+(The parallel features only affect the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.)</p>
+
+<p><strong>You Set the Number of Reducers</strong></p>
+<p>Use the <a href="cmds.html#set">set default parallel</a> command to set the number of reducers at the script level.</p>
+
+<p>Alternatively, use the PARALLEL clause to set the number of reducers at the operator level. 
+(In a script, the value set via the PARALLEL clause will override any value set via "set default parallel.")
+You can include the PARALLEL clause with any operator that starts a reduce phase:  
+<a href="basic.html#COGROUP">COGROUP</a>, 
+<a href="basic.html#CROSS">CROSS</a>, 
+<a href="basic.html#DISTINCT">DISTINCT</a>, 
+<a href="basic.html#GROUP">GROUP</a>, 
+<a href="basic.html#JOIN+%28inner%29">JOIN (inner)</a>, 
+<a href="basic.html#JOIN+%28outer%29">JOIN (outer)</a>, and
+<a href="basic.html#ORDER+BY">ORDER BY</a>.
+</p>
+
+<p>The number of reducers you need for a particular construct in Pig that forms a MapReduce boundary depends entirely on (1) your data and the number of intermediate keys you are generating in your mappers and (2) the partitioner and distribution of map (combiner) output keys. In the best cases we have seen that a reducer processing about 1 GB of data behaves efficiently.</p>
+
+<p><strong>Let Pig Set the Number of Reducers</strong></p>
+<p>If  neither "set default parallel" nor the PARALLEL clause are used, Pig sets the number of reducers using a heuristic based on the size of the input data. You can set the values for these properties:</p>
+<ul>
+	<li>pig.exec.reducers.bytes.per.reducer - Defines the number of input bytes per reduce; default value is 1000*1000*1000 (1GB).</li>
+	<li>pig.exec.reducers.max - Defines the upper bound on the number of reducers; default is 999. </li>
+</ul>
+<p></p>
+
+<p>The formula, shown below, is very simple and will improve over time. The computed value takes all inputs within the script into account and applies the computed value to all the jobs within Pig script.</p>
+
+<p><code>#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per reducer) </code></p>
+
+<p><strong>Examples</strong></p>
+<p>In this example PARALLEL is used with the GROUP operator. </p>
+<source>
+A = LOAD 'myfile' AS (t, u, v);
+B = GROUP A BY t PARALLEL 18;
+...
+</source>
+
+<p>In this example all the MapReduce jobs that get launched use 20 reducers.</p>
+<source>
+SET default_parallel 20;
+A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v);
+B = GROUP A BY t;
+C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
+D = ORDER C BY mycount;
+STORE D INTO ‘mysortedcount’ USING PigStorage();
+</source>
+</section>
+
+
+<section>
+<title>Use the LIMIT Operator</title>
+<p>Often you are not interested in the entire output but rather a sample or top results. In such cases, using LIMIT can yield a much better performance as we push the limit as high as possible to minimize the amount of data travelling through the pipeline. </p>
+<p>Sample: 
+</p>
+
+<source>
+A = load 'myfile' as (t, u, v);
+B = limit A 500;
+</source>
+
+<p>Top results: </p>
+
+<source>
+A = load 'myfile' as (t, u, v);
+B = order A by t;
+C = limit B 500;
+</source>
+
+</section>
+
+<section>
+<title>Prefer DISTINCT over GROUP BY/GENERATE</title>
+
+<p>To extract unique values from a column in a relation you can use DISTINCT or GROUP BY/GENERATE. DISTINCT is the preferred method; it is faster and more efficient.</p>
+
+<p>Example using GROUP BY - GENERATE:</p>
+
+<source>
+A = load 'myfile' as (t, u, v);
+B = foreach A generate u;
+C = group B by u;
+D = foreach C generate group as uniquekey;
+dump D; 
+</source>
+
+<p>Example using DISTINCT:</p>
+
+<source>
+A = load 'myfile' as (t, u, v);
+B = foreach A generate u;
+C = distinct B;
+dump C; 
+</source>
+</section>
+
+<section>
+<title>Compress the Results of Intermediate Jobs</title>
+<p>If your Pig script generates a sequence of MapReduce jobs, you can compress the output of the intermediate jobs using LZO compression. (Use the <a href="test.html#EXPLAIN">EXPLAIN</a> operator to determine if your script produces multiple MapReduce Jobs.)</p>
+
+<p>By doing this, you will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data that is generated, the more benefits in storage and speed that result.</p>
+
+<p>You can set the value for these properties:</p>
+<ul>
+	<li>pig.tmpfilecompression - Determines if the temporary files should be compressed or not (set to false by default).</li>
+	<li>pig.tmpfilecompression.codec - Specifies which compression codec to use. Currently, Pig accepts "gz" and "lzo" as possible values. However, because LZO is under GPL license (and disabled by default) you will need to configure your cluster to use the LZO codec to take advantage of this feature. For details, see http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ.</li>
+</ul>
+<p></p>
+
+<p>On the non-trivial queries (one ran longer than a couple of minutes) we saw significant improvements both in terms of query latency and space usage. For some queries we saw up to 96% disk saving and up to 4x query speed up. Of course, the performance characteristics are very much query and data dependent and testing needs to be done to determine gains. We did not see any slowdown in the tests we peformed which means that you are at least saving on space while using compression.</p>
+
+<p>With gzip we saw a better compression (96-99%) but at a cost of 4% slowdown. Thus, we don't recommend using gzip. </p>
+
+<p><strong>Example</strong></p>
+<source>
+-- launch Pig script using lzo compression 
+
+java -cp $PIG_HOME/pig.jar 
+-Djava.library.path=&lt;path to the lzo library&gt; 
+-Dpig.tmpfilecompression=true 
+-Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main  myscript.pig 
+</source>
+</section>
+
+<section>
+<title>Combine Small Input Files</title>
+<p>Processing input (either user input or intermediate input) from multiple small files can be inefficient because a separate map has to be created for each file. Pig can now combined small files so that they are processed as a single map.</p>
+
+<p>You can set the values for these properties:</p>
+
+<ul>
+<li>pig.maxCombinedSplitSize – Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined untill this size is reached. </li>
+<li>pig.splitCombination – Turns combine split files on or off (set to “true” by default).</li>
+</ul>
+<p></p>
+
+<p>This feature works with <a href="func.html#PigStorage">PigStorage</a>. However, if you are using a custom loader, please note the following:</p>
+
+<ul>
+<li>If your loader implementation makes use of the PigSplit object passed through the prepareToRead method, then you may need to rebuild the loader since the definition of PigSplit has been modified. </li>
+<li>The loader must be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument.</li>
+<li>If a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations.</li>
+</ul>
+<p></p>
+</section>
+</section>
+  
+<!-- ==================================================================== -->
+<!-- SPECIALIZED JOINS-->
+  <section>
+   <title>Specialized Joins</title>
+<!-- FRAGMENT REPLICATE JOINS-->
+<section>
+<title>Replicated Joins</title>
+<p>Fragment replicate join is a special type of join that works well if one or more relations are small enough to fit into main memory. 
+In such cases, Pig can perform a very efficient join because all of the hadoop work is done on the map side. In this type of join the 
+large relation is followed by one or more small relations. The small relations must be small enough to fit into main memory; if they 
+don't, the process fails and an error is generated.</p>
+ 
+<section>
+<title>Usage</title>
+<p>Perform a replicated join with the USING clause (see <a href="basic.html#JOIN+%28inner%29">inner joins</a> and <a href="basic.html#JOIN+%28outer%29">outer joins</a>).
+In this example, a large relation is joined with two smaller relations. Note that the large relation comes first followed by the smaller relations; 
+and, all small relations together must fit into main memory, otherwise an error is generated. </p>
+<source>
+big = LOAD 'big_data' AS (b1,b2,b3);
+
+tiny = LOAD 'tiny_data' AS (t1,t2,t3);
+
+mini = LOAD 'mini_data' AS (m1,m2,m3);
+
+C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
+</source>
+</section>
+
+<section>
+<title>Conditions</title>
+<p>Fragment replicate joins are experimental; we don't have a strong sense of how small the small relation must be to fit 
+into memory. In our tests with a simple query that involves just a JOIN, a relation of up to 100 M can be used if the process overall 
+gets 1 GB of memory. Please share your observations and experience with us.</p>
+</section>
+</section>
+<!-- END FRAGMENT REPLICATE JOINS-->
+
+
+<!-- SKEWED JOINS-->
+<section>
+<title>Skewed Joins</title>
+
+<p>
+Parallel joins are vulnerable to the presence of skew in the underlying data. 
+If the underlying data is sufficiently skewed, load imbalances will swamp any of the parallelism gains. 
+In order to counteract this problem, skewed join computes a histogram of the key space and uses this 
+data to allocate reducers for a given key. Skewed join does not place a restriction on the size of the input keys. 
+It accomplishes this by splitting the left input on the join predicate and streaming the right input. The left input is 
+sampled to create the histogram.
+</p>
+
+<p>
+Skewed join can be used when the underlying data is sufficiently skewed and you need a finer 
+control over the allocation of reducers to counteract the skew. It should also be used when the data 
+associated with a given key is too large to fit in memory.
+</p>
+
+<section>
+<title>Usage</title>
+<p>Perform a skewed join with the USING clause (see <a href="basic.html#JOIN+%28inner%29">inner joins</a> and <a href="basic.html#JOIN+%28outer%29">outer joins</a>). </p>
+<source>
+big = LOAD 'big_data' AS (b1,b2,b3);
+massive = LOAD 'massive_data' AS (m1,m2,m3);
+C = JOIN big BY b1, massive BY m1 USING 'skewed';
+</source>
+</section>
+
+<section>
+<title>Conditions</title>
+<p>
+Skewed join will only work under these conditions: 
+</p>
+<ul>
+<li>Skewed join works with two-table inner join. Currently we do not support more than two tables for skewed join. 
+Specifying three-way (or more) joins will fail validation. For such joins, we rely on you to break them up into two-way joins.</li>
+<li>The pig.skewedjoin.reduce.memusage Java parameter specifies the fraction of heap available for the 
+reducer to perform the join. A low fraction forces Pig to use more reducers but increases 
+copying cost. We have seen good performance when we set this value 
+in the range 0.1 - 0.4. However, note that this is hardly an accurate range. Its value 
+depends on the amount of heap available for the operation, the number of columns 
+in the input and the skew. An appropriate value is best obtained by conducting experiments to achieve 
+a good performance. The default value is 0.5. </li>
+<li>Skewed join does not address (balance) uneven data distribution across reducers. 
+However, in most cases, skewed join ensures that the join will finish (however slowly) rather than fail.
+</li>
+</ul>
+</section>
+</section><!-- END SKEWED JOINS-->
+
+
+<!-- MERGE JOIN-->
+<section>
+<title>Merge Joins</title>
+
+<p>
+Often user data is stored such that both inputs are already sorted on the join key. 
+In this case, it is possible to join the data in the map phase of a MapReduce job. 
+This provides a significant performance improvement compared to passing all of the data through 
+unneeded sort and shuffle phases. 
+</p>
+
+<p>
+Pig has implemented a merge join algorithm, or sort-merge join, although in this case the sort is already 
+assumed to have been done (see the Conditions, below). 
+
+Pig implements the merge join algorithm by selecting the left input of the join to be the input file for the map phase, 
+and the right input of the join to be the side file. It then samples records from the right input to build an
+ index that contains, for each sampled record, the key(s) the filename and the offset into the file the record 
+ begins at. This sampling is done in the first MapReduce job. A second MapReduce job is then initiated, 
+ with the left input as its input. Each map uses the index to seek to the appropriate record in the right 
+ input and begin doing the join. 
+</p>
+
+<section>
+<title>Usage</title>
+<p>Perform a merge join with the USING clause (see <a href="basic.html#JOIN+%28inner%29">inner joins</a> and <a href="basic.html#JOIN+%28outer%29">outer joins</a>). </p>
+<source>
+C = JOIN A BY a1, B BY b1, C BY c1 USING 'merge';
+</source>
+</section>
+
+<section>
+<title>Conditions</title>
+<p><strong>Condition A</strong></p>
+<p>Inner merge join (between two tables) will only work under these conditions: </p>
+<ul>
+<li>Between the load of the sorted input and the merge join statement there can only be filter statements and 
+foreach statement where the foreach statement should meet the following conditions: 
+<ul>
+<li>There should be no UDFs in the foreach statement. </li>
+<li>The foreach statement should not change the position of the join keys. </li>
+<li>There should be no transformation on the join keys which will change the sort order. </li>
+</ul>
+</li>
+<li>Data must be sorted on join keys in ascending (ASC) order on both sides.</li>
+<li>Right-side loader must implement either the {OrderedLoadFunc} interface or {IndexableLoadFunc} interface.</li>
+<li>Type information must be provided for the join key in the schema.</li>
+</ul>
+<p></p>
+<p>The Zebra and PigStorage loaders satisfy all of these conditions.</p>
+<p></p>
+
+<p><strong>Condition B</strong></p>
+<p>Outer merge join (between two tables) and inner merge join (between three or more tables) will only work under these conditions: </p>
+<ul>
+<li>No other operations can be done between the load and join statements. </li>
+<li>Data must be sorted on join keys in ascending (ASC) order on both sides. </li>
+<li>Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}. </li>
+<li>All other loaders must implement {IndexableLoadFunc}. </li>
+<li>Type information must be provided for the join key in the schema.</li>
+</ul>
+<p></p>
+<p>The Zebra loader satisfies all of these conditions.</p>
+
+<p>An example of a left outer merge join using the Zebra loader:</p>
+<source>
+A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); 
+B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); 
+C = join A by id left, B by id using 'merge'; 
+</source>
+
+<p></p>
+<p><strong>Both Conditions</strong></p>
+<p>
+For optimal performance, each part file of the left (sorted) input of the join should have a size of at least 
+1 hdfs block size (for example if the hdfs block size is 128 MB, each part file should be less than 128 MB). 
+If the total input size (including all part files) is greater than blocksize, then the part files should be uniform in size 
+(without large skews in sizes). The main idea is to eliminate skew in the amount of input the final map 
+job performing the merge-join will process. 
+</p>
+
+</section>
+</section><!-- END MERGE JOIN -->
+
+<!-- END SPECIALIZED JOINS--> 
+   
+	</section>
+
+</body>
+</document>

Modified: pig/trunk/src/docs/src/documentation/content/xdocs/site.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/site.xml?rev=1050082&r1=1050081&r2=1050082&view=diff
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/site.xml (original)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/site.xml Thu Dec 16 18:10:59 2010
@@ -36,20 +36,20 @@ See http://forrest.apache.org/docs/linki
   always use index.html when you request http://yourHost/
   See FAQ: "How can I use a start-up-page other than index.html?"
 -->
-<site label="Pig" href="" xmlns="http://apache.org/forrest/linkmap/1.0"
-  tab="">
+<site label="Pig" href="" xmlns="http://apache.org/forrest/linkmap/1.0" tab="">
 
   <docs label="Pig"> 
     <index label="Overview" 				href="index.html" />
-    <quickstart label="Setup"	            href="setup.html" />
-    <tutorial label="Tutorial"				 	href="tutorial.html" />
-    <plref1 label="Pig Latin 1"	href="piglatin_ref1.html" />
-    <plref2 label="Pig Latin 2"	href="piglatin_ref2.html" />
-    <cookbook label="Cookbook" 		href="cookbook.html" />
-    <udf label="UDFs" href="udf.html" />
-    <udf label="PigUnit" href="pigunit.html" />
+    <start label="Getting Started"	        href="start.html" />
+    <basics label="Pig Latin Basics"	href="basic.html" />
+    <funct label="Built In Functions"	href="func.html" />
+    <udf label="User Defined Functions"  href="udf.html" />
+    <control label="Control Structures"	href="cont.html" />
+    <cmds label="Shell and Utililty Commands" href="cmds.html" />
+    <perform label="Performance and Efficiency" href="perf.html" />
+    <test label="Testing and Diagnostics" href="test.html" />
     </docs>  
-    
+      
     <docs label="Zebra"> 
      <zover label="Zebra Overview "	href="zebra_overview.html" />
      <zusers label="Zebra Users "	href="zebra_users.html" />
@@ -61,7 +61,7 @@ See http://forrest.apache.org/docs/linki
 
      <docs label="Miscellaneous"> 
      <api	label="API Docs" href="api/"/>
-     <jdiff	label="API Changes" href="ext:jdiff/changes"/>
+     <jdiff label="API Changes" href="ext:jdiff/changes"/>
      <wiki  label="Wiki" href="ext:wiki" />
      <faq  label="FAQ" href="ext:faq" />
      <relnotes  label="Release Notes" 	href="ext:relnotes" />
@@ -72,7 +72,7 @@ See http://forrest.apache.org/docs/linki
     <faq       href="http://wiki.apache.org/pig/FAQ" />
     <relnotes  href="http://hadoop.apache.org/pig/releases.html" />
     <jdiff     href="jdiff/">
-      <changes href="changes.html" />
+    <changes href="changes.html" />
     </jdiff>
  </external-refs>
 </site>

Added: pig/trunk/src/docs/src/documentation/content/xdocs/start.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/start.xml?rev=1050082&view=auto
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/start.xml (added)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/start.xml Thu Dec 16 18:10:59 2010
@@ -0,0 +1,967 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
+<document>
+  <header>
+    <title>Getting Started</title>
+  </header>
+  <body>
+  
+<!-- ========================================================== -->  
+
+<!-- SET UP PIG -->
+ <section>
+		<title>Pig Setup</title>
+	
+<!-- ++++++++++++++++++++++++++++++++++ -->
+ <section id="req">
+ <title>Requirements</title>
+      <p>Unix and Windows users need the following:</p>
+		<ul>
+		  <li> <strong>Hadoop 0.20.2</strong> - <a href="http://hadoop.apache.org/common/releases.html">http://hadoop.apache.org/common/releases.html</a></li>
+		  <li> <strong>Java 1.6</strong> - <a href="http://java.sun.com/javase/downloads/index.jsp">http://java.sun.com/javase/downloads/index.jsp</a> (set JAVA_HOME to the root of your Java installation)</li>
+		  <li> <strong>Ant 1.7</strong> - <a href="http://ant.apache.org/">http://ant.apache.org/</a> (optional, for builds) </li>
+		  <li> <strong>JUnit 4.5</strong> - <a href="http://junit.sourceforge.net/">http://junit.sourceforge.net/</a> (optional, for unit tests) </li>
+		</ul>
+		<p></p>
+	<p>Windows users also need to install Cygwin and the Perl package: <a href="http://www.cygwin.com/"> http://www.cygwin.com/</a></p>
+  </section>         
+   
+<!-- ++++++++++++++++++++++++++++++++++ -->        
+ <section>
+ <title>Download Pig</title>
+	<p>To get a Pig distribution, do the following:</p>
+	
+	<ol>
+	<li>Download a recent stable release from one of the Apache Download Mirrors 
+	(see <a href="http://hadoop.apache.org/pig/releases.html"> Pig Releases</a>).</li>
+	
+    <li>Unpack the downloaded Pig distribution, and then note the following:
+	    <ul>
+	    <li>The Pig script file, pig, is located in the bin directory (/pig-n.n.n/bin/pig). 
+	    The Pig environment variables are described in the Pig script file.</li>
+	    <li>The Pig properties file, pig.properties, is located in the /pig-n.n.n/conf directory. 
+	    You can specify an alternate location using the PIG_CONF_DIR environment variable.</li>
+	</ul>	
+	</li>
+	<li>Add /pig-n.n.n/bin to your path. Use export (bash,sh,ksh) or setenv (tcsh,csh). For example: <br></br>
+	<code>$ export PATH=/&lt;my-path-to-pig&gt;/pig-n.n.n/bin:$PATH</code>
+</li>
+<li>
+Test the Pig installation with this simple command: <code>$ pig -help</code>
+</li>
+</ol>
+
+</section>  
+
+<!-- ++++++++++++++++++++++++++++++++++ -->
+<section>
+<title>Build Pig</title>
+      <p>To build pig, do the following:</p>
+     <ol>
+	  <li> Check out the Pig code from SVN: <code>svn co http://svn.apache.org/repos/asf/pig/trunk</code> </li>
+	  <li> Build the code from the top directory: <code>ant</code> <br></br>
+	  If the build is successful, you should see the pig.jar file created in that directory. </li>	
+	  <li> Validate the pig.jar  by running a unit test: <code>ant test</code></li>
+     </ol>
+ </section>
+
+</section>
+
+  <!-- ==================================================================== -->
+    
+   <!-- RUNNING PIG  -->
+   <section>
+	<title>Running Pig </title> 
+	<p>You can run or execute Pig Latin statements in various ways.</p>
+	<table>
+	<tr>
+	<td><strong>Pig Latin Statements</strong></td>
+    <td><strong>Local Mode</strong></td>
+    <td><strong>Mapreduce Mode</strong></td>
+	</tr>
+	<tr>
+	<td>Grunt Shell (enter statements interactively or run Pig scripts)</td>
+    <td>yes</td>
+    <td>yes</td>
+	</tr>
+	<tr>
+	<td>Pig Scripts (run batch statements from command line or Grunt shell)</td>
+    <td>yes</td>
+    <td>yes</td>
+	</tr>
+	<tr>
+	<td>Embedded Programs (embed statements in a host language)</td>
+    <td>yes</td>
+    <td>yes</td>
+	</tr>
+	</table>
+	
+	<!-- ++++++++++++++++++++++++++++++++++ -->
+	   <section>
+	<title>Run Modes</title> 
+<p>Pig has two run modes or exectypes: </p>
+<ul>
+<li><strong>Local Mode</strong> - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag.
+</li>
+<li><strong>Mapreduce Mode</strong> - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, <em>but don't need to</em>, specify it using the -x flag.
+</li>
+</ul>
+<p></p>
+
+<p>You can run the Grunt shell and Pig scripts in either mode using the "pig" or "java" command. 
+You can compile and run embedded programs in either mode using the conventions of the host langugage. </p>
+
+
+<section>
+<title>Examples</title>
+
+<p>This example shows how to run Pig in local and mapreduce mode using the pig command.</p>
+<source>
+/* local mode */
+$ pig -x local ...
+
+/* mapreduce mode */
+$ pig ...
+$ pig -x mapreduce ...
+</source>
+
+<p>This example shows how to run Pig in local and mapreduce mode using the java command.</p>
+<source>
+/* local mode */
+$ java -cp pig.jar org.apache.pig.Main -x local ...
+
+/* mapreduce mode */
+$ java -cp pig.jar org.apache.pig.Main ...
+$ java -cp pig.jar org.apache.pig.Main -x mapreduce ...
+</source>
+
+</section>
+</section>
+
+<!-- ++++++++++++++++++++++++++++++++++ -->
+<section>
+<title>Grunt Shell</title>
+<p>Use Pig's Grunt shell to enter Pig Latin statements interactively. 
+You can also run Pig scripts from the Grunt shell 
+(see the <a href="cmds.html#run">run</a> and <a href="cmds.html#exec">exec</a> commands). </p>
+
+<section>
+<title>Example</title>
+<p>These Pig Latin statements extract all user IDs from the /etc/passwd file. First, copy the /etc/passwd file to your local working directory. Next, invoke the Grunt shell by typing the pig command (in local mode or mapreduce mode). Then, enter the Pig Latin statements interactively at the grunt prompt (be sure to include the semicolon after each statement). The DUMP operator will display the results to your terminal screen.</p>
+<source>
+grunt&gt; A = load 'passwd' using PigStorage(':'); 
+grunt&gt; B = foreach A generate $0 as id; 
+grunt&gt; dump B; 
+</source>
+
+<p><strong>Local Mode</strong></p>
+<source>
+$ pig -x local
+... - Connecting to ...
+grunt>
+</source>
+
+<p><strong>Mapreduce Mode</strong> </p>
+<source>
+$ pig -x mapreduce
+or
+$ pig 
+... - Connecting to ...
+grunt>
+</source>
+</section>
+</section>
+
+<!-- ++++++++++++++++++++++++++++++++++ -->
+<section>
+<title>Pig Scripts</title>
+<p>Use script files to run Pig Latin statements as batch jobs. With Pig scripts you can pass values to parameters using <a href="cont.html#Parameter+Substitution">parameter subtitution</a>. </p>
+
+   <section>
+   <title>Pig Scripts and Comments</title>
+   <p>You can include comments in Pig scripts:</p>
+   <ul>
+      <li>
+         <p>For multi-line comments use /* …. */</p>
+      </li>
+      <li>
+         <p>For single-line comments use --</p>
+      </li>
+   </ul>
+<source>
+/* myscript.pig
+My script is simple.
+It includes three Pig Latin statements.
+*/
+
+A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); -- loading data
+B = FOREACH A GENERATE name;  -- transforming data
+DUMP B;  -- retrieving results
+</source>   
+</section>
+
+<section>
+<title>Pig Scripts and DFS</title>
+<p>Pig supports running scripts (and Jar files) that are stored in HDFS, Amazon S3, and other distributed file systems. The script's full location URI is required (see <a href="basic.html#REGISTER">REGISTER</a> for information about Jar files). For example, to run a Pig script on HDFS, do the following:</p>
+<source>
+$ pig hdfs://nn.mydomain.com:9020/myscripts/script.pig
+</source> 
+</section>
+
+<section>
+<title>Example</title>
+
+<p>The Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy the /etc/passwd file to your local working directory. Next, run the Pig script from the command line (using local or mapreduce mode). The STORE operator will write the results to a file (id.out).</p>
+<source>
+/* id.pig */
+
+A = load 'passwd' using PigStorage(':');  -- load the passwd file 
+B = foreach A generate $0 as id;  -- extract the user IDs 
+store B into ‘id.out’;  -- write the results to a file name id.out
+</source>
+
+<p><strong>Local Mode</strong></p>
+<source>
+$ pig -x local id.pig
+</source>
+<p><strong>Mapreduce Mode</strong> </p>
+<source>
+$ pig id.pig
+or
+$ pig -x mapreduce id.pig
+</source>
+</section>
+
+</section>
+
+<!-- ++++++++++++++++++++++++++++++++++ -->
+<section>
+<title>Embedded Programs</title>
+<p>Use the embedded option to embed Pig statements in a host language. Currently Java and Python are supported.</p>
+
+<section>
+<title>Java Example</title>
+
+<!-- ++++++++++++++++++++++++++++++++++ -->
+<p><strong>Local Mode</strong></p>
+<p>From your current working directory, compile the program. (Note that idlocal.class is written to your current working directory. Include “.” in the class path when you run the program.) </p>
+<source>
+$ javac -cp pig.jar idlocal.java
+</source>
+<p> </p>
+<p>From your current working directory, run the program. To view the results, check the output file, id.out.</p>
+<source>
+Unix:   $ java -cp pig.jar:. idlocal
+Cygwin: $ java –cp ‘.;pig.jar’ idlocal
+</source>
+
+<p>idlocal.java - The sample code is based on Pig Latin statements that extract all user IDs from the /etc/passwd file. 
+Copy the /etc/passwd file to your local working directory.</p>
+<source>
+import java.io.IOException;
+import org.apache.pig.PigServer;
+public class idlocal{ 
+public static void main(String[] args) {
+try {
+    PigServer pigServer = new PigServer("local");
+    runIdQuery(pigServer, "passwd");
+    }
+    catch(Exception e) {
+    }
+ }
+public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
+    pigServer.registerQuery("A = load '" + inputFile + "' using PigStorage(':');");
+    pigServer.registerQuery("B = foreach A generate $0 as id;");
+    pigServer.store("B", "id.out");
+ }
+}
+</source>
+<p> </p>
+
+<!-- ++++++++++++++++++++++++++++++++++ -->
+<p><strong>Mapreduce Mode</strong></p>
+<p>Point $HADOOPDIR to the directory that contains the hadoop-site.xml file. Example: 
+</p>
+<source>
+$ export HADOOPDIR=/yourHADOOPsite/conf 
+</source>
+<p>From your current working directory, compile the program. (Note that idmapreduce.class is written to your current working directory. Include “.” in the class path when you run the program.)
+</p>
+<source>
+$ javac -cp pig.jar idmapreduce.java
+</source>
+<p></p>
+<p>From your current working directory, run the program. To view the results, check the idout directory on your Hadoop system. </p>
+<source>
+Unix:   $ java -cp pig.jar:.:$HADOOPDIR idmapreduce
+Cygwin: $ java –cp ‘.;pig.jar;$HADOOPDIR’ idmapreduce
+</source>
+
+<p>idmapreduce.java - The sample code is based on Pig Latin statements that extract all user IDs from the /etc/passwd file. 
+Copy the /etc/passwd file to your local working directory.</p>
+<source>
+import java.io.IOException;
+import org.apache.pig.PigServer;
+public class idmapreduce{
+   public static void main(String[] args) {
+   try {
+     PigServer pigServer = new PigServer("mapreduce");
+     runIdQuery(pigServer, "passwd");
+   }
+   catch(Exception e) {
+   }
+}
+public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
+   pigServer.registerQuery("A = load '" + inputFile + "' using PigStorage(':');")
+   pigServer.registerQuery("B = foreach A generate $0 as id;");
+   pigServer.store("B", "idout");
+   }
+}
+</source>
+</section>
+
+</section>
+</section>
+
+
+  <!-- ==================================================================== -->
+    
+   <!-- PIG LATIN STATEMENTS -->
+   <section>
+	<title>Pig Latin Statements</title>	
+   <p>Pig Latin statements are the basic constructs you use to process data using Pig. 
+   A Pig Latin statement is an operator that takes a <a href="basic.html#relations">relation</a> as input and produces another relation as output. 
+   (This definition applies to all Pig Latin operators except LOAD and STORE which read data from and write data to the file system.) 
+   Pig Latin statements may include <a href="basic.html#Expressions">expressions</a> and <a href="basic.html#Schemas">schemas</a>. 
+   Pig Latin statements can span multiple lines and must end with a semi-colon ( ; ).  
+   By default, Pig Latin statements are processed using <a href="perf.html#Multi-Query+Execution">multi-query execution</a>.  
+ </p>
+   
+   <p>Pig Latin statements are generally organized as follows:</p>
+   <ul>
+      <li>
+         <p>A LOAD statement to read data from the file system. </p>
+      </li>
+      <li>
+         <p>A series of "transformation" statements to process the data. </p>
+      </li>
+      <li>
+         <p>A DUMP statement to view results or a STORE statement to save the results.</p>
+      </li>
+   </ul>
+<p></p>
+   <p>Note that a DUMP or STORE statement is required to generate output.</p>
+<ul>
+<li>
+<p>In this example Pig will validate, but not execute, the LOAD and FOREACH statements.</p>
+<source>
+A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
+B = FOREACH A GENERATE name;
+</source> 
+</li>
+<li>
+<p>In this example, Pig will validate and then execute the LOAD, FOREACH, and DUMP statements.</p>
+<source>
+A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
+B = FOREACH A GENERATE name;
+DUMP B;
+(John)
+(Mary)
+(Bill)
+(Joe)
+</source>
+</li>
+</ul>
+  
+   <!-- ++++++++++++++++++++++++++++++++++ -->   
+   <section>
+   <title>Loading Data</title>
+   <p>Use the  <a href="basic.html#LOAD">LOAD</a> operator and the <a href="func.html#Load%2FStore+Functions">load/store functions</a> to read data into Pig (PigStorage is the default load function).</p>
+   </section>
+  
+   <!-- ++++++++++++++++++++++++++++++++++ -->   
+   <section>
+   <title>Working with Data</title>
+   <p>Pig allows you to transform data in many ways. As a starting point, become familiar with these operators:</p>
+   <ul>
+      <li>
+         <p>Use the <a href="basic.html#FILTER">FILTER</a> operator to work with tuples or rows of data. 
+         Use the <a href="basic.html#FOREACH">FOREACH</a> operator to work with columns of data.</p>
+      </li>
+      <li>
+         <p>Use the <a href="basic.html#GROUP ">GROUP</a> operator to group data in a single relation. 
+         Use the <a href="basic.html#COGROUP ">COGROUP</a>,
+         <a href="basic.html#JOIN+%28inner%29">inner JOIN</a>, and
+         <a href="basic.html#JOIN+%28outer%29">outer JOIN</a>
+         operators  to group or join data in two or more relations.</p>
+      </li>
+      <li>
+         <p>Use the <a href="basic.html#UNION">UNION</a> operator to merge the contents of two or more relations. 
+         Use the <a href="basic.html#SPLIT">SPLIT</a> operator to partition the contents of a relation into multiple relations.</p>
+      </li>
+   </ul>
+   </section>
+   
+<!-- ++++++++++++++++++++++++++++++++++ --> 
+      <section>
+   <title>Storing Intermediate Data</title>
+
+      <p>Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. 
+   This location must already exist on HDFS prior to use. 
+   This location can be configured using the pig.temp.dir property. The property's default value is "/tmp" which is the same 
+   as the hardcoded location in Pig 0.7.0 and earlier versions. </p>
+      </section>
+   
+    <section>
+   <title>Storing Results</title>
+   <p>Use the  <a href="basic.html#STORE">STORE</a> operator and the <a href="func.html#Load%2FStore+Functions">load/store functions</a> 
+   to write results to the file system (PigStorage is the default store function). </p>
+<p><strong>Note:</strong> During the testing/debugging phase of your implementation, you can use DUMP to display results to your terminal screen. 
+However, in a production environment you always want to use the STORE operator to save your results (see <a href="perf.html#Store+vs.+Dump">Store vs. Dump</a>).</p>   
+   </section> 
+
+ <!-- ++++++++++++++++++++++++++++++++++ -->     
+   <section>
+   <title>Debugging Pig Latin</title>
+   <p>Pig Latin provides operators that can help you debug your Pig Latin statements:</p>
+   <ul>
+      <li>
+         <p>Use the  <a href="test.html#DUMP">DUMP</a> operator to display results to your terminal screen. </p>
+      </li>
+      <li>
+         <p>Use the  <a href="test.html#DESCRIBE">DESCRIBE</a> operator to review the schema of a relation.</p>
+      </li>
+      <li>
+         <p>Use the  <a href="test.html#EXPLAIN">EXPLAIN</a> operator to view the logical, physical, or map reduce execution plans to compute a relation.</p>
+      </li>
+      <li>
+         <p>Use the  <a href="test.html#ILLUSTRATE">ILLUSTRATE</a> operator to view the step-by-step execution of a series of statements.</p>
+      </li>
+   </ul>
+</section> 
+</section>  
+
+
+<!-- ================================================================== -->
+<!-- PIG PROPERTIES -->
+<section>
+<title>Pig Properties</title>
+   <p>
+The Pig "-propertyfile" option enables you to pass a set of Pig or Hadoop properties to a Pig job. If the value is present in both the property file passed from the command line as well as in default property file bundled into pig.jar, the properties passed from command line take precedence. This property, as well as all other properties defined in Pig, are available to your UDFs via UDFContext.getClientSystemProps()API call (see the <a href="udf.html">Pig UDF Manual</a>.)  </p>
+
+<p>You can retrieve a list of all properties using the <a href="cmds.html#help">help properties</a> command.</p>
+<p>You can set properties using the <a href="cmds.html#set">set</a> command.</p>
+</section>  
+
+
+  <!-- ==================================================================== -->
+  <!-- PIG TUTORIAL -->
+  <section>
+<title>Pig Tutorial </title>
+
+<p>The Pig tutorial shows you how to run two Pig scripts using Pig's local mode and mapreduce mode (see <a href="#Run+Modes">Run Modes</a>).</p>
+
+<p>The Pig tutorial file, tutorial/pigtutorial.tar.gz, is part of the Pig distribution (see <a href="#Download+Pig">Download Pig</a>). The zipped file includes the tutorial JAR file, Pig scripts, and log/data files (see <a href="#Pig+Tutorial+Files">Pig Tutorial Files</a>). These files work with Hadoop 0.20.2 and include everything you need to run the Pig scripts, which are explained line-by-line (see <a href="#Pig+Script+1%3A+Query+Phrase+Popularity">Pig Script 1</a> and 
+<a href="#Pig+Script+2%3A+Temporal+Query+Phrase+Popularity">Pig Script 2</a>).</p>
+
+<p>To get started with the Pig tutorial, do the following preliminary tasks:</p>
+
+<ol>
+<li>Make sure the JAVA_HOME environment variable is set the root of your Java installation.</li>
+<li>Make sure that bin/pig is in your PATH (this enables you to run the scripts using the "pig" command).
+<source>
+$ export PATH=/&lt;my-path-to-pig&gt;/pig-n.n.n/bin:$PATH 
+</source>
+</li>
+<li>Set the PIG_HOME environment variable:
+<source>
+$ export PIG_HOME=/&lt;my-path-to-pig&gt;/pig-n.n.n 
+</source></li>
+<li>Copy the pigtutorial.tar.gz file from the tutorial directory of your Pig installation to your local directory. </li>
+<li>Unzip the Pig tutorial file; the files are stored in a newly created directory, pigtmp. 
+<source>
+$ tar -xzf pigtutorial.tar.gz
+</source>
+</li>
+</ol>
+
+
+ <!-- ++++++++++++++++++++++++++++++++++ --> 
+<section>
+<title> Running the Pig Scripts in Local Mode</title>
+
+<p>To run the Pig scripts in local mode, do the following: </p>
+<ol>
+<li>Move to the pigtmp directory.</li>
+<li>Execute the following command (using either script1-local.pig or script2-local.pig). 
+<source>
+$ pig -x local script1-local.pig
+</source>
+</li>
+<li>Review the result files, located in the part-r-00000 directory.
+<p>The output may contain a few Hadoop warnings which can be ignored:</p>
+<source>
+2010-04-08 12:55:33,642 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics 
+- Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
+</source>
+</li>
+</ol>
+</section>
+
+ <!-- ++++++++++++++++++++++++++++++++++ --> 
+<section>
+<title> Running the Pig Scripts in Mapreduce Mode</title>
+
+<p>To run the Pig scripts in mapreduce mode, do the following: </p>
+<ol>
+<li>Move to the pigtmp directory.</li>
+<li>Copy the excite.log.bz2 file from the pigtmp directory to the HDFS directory.
+<source>
+$ hadoop fs –copyFromLocal excite.log.bz2 .
+</source>
+</li>
+
+<li>Set the PIG_CLASSPATH environment variable to the location of the cluster configuration directory (the directory that contains the core-site.xml, hdfs-site.xml and mapred-site.xml files):
+<source>
+export PIG_CLASSPATH=/mycluster/conf
+</source></li>
+<li>Set the HADOOP_CONF_DIR environment variable to the location of the cluster configuration directory:
+<source>
+export HADOOP_CONF_DIR=/mycluster/conf
+</source></li>
+
+<li>Execute the following command (using either script1-hadoop.pig or script2-hadoop.pig):
+<source>
+$ pig script1-hadoop.pig
+</source>
+</li>
+
+<li>Review the result files, located in the script1-hadoop-results or script2-hadoop-results HDFS directory:
+<source>
+$ hadoop fs -ls script1-hadoop-results
+$ hadoop fs -cat 'script1-hadoop-results/*' | less
+</source>
+</li>
+</ol>
+</section>
+
+ <!-- ++++++++++++++++++++++++++++++++++ -->   
+<section>
+<title> Pig Tutorial Files</title>
+
+<p>The contents of the Pig tutorial file (pigtutorial.tar.gz) are described here. </p>
+
+<table>
+<tr>
+<td>
+<p> <strong>File</strong> </p>
+</td>
+<td>
+<p> <strong>Description</strong></p>
+</td>
+</tr>
+<tr>
+<td>
+<p> pig.jar </p>
+</td>
+<td>
+<p> Pig JAR file </p>
+</td>
+</tr>
+<tr>
+<td>
+<p> tutorial.jar </p>
+</td>
+<td>
+<p> User-defined functions (UDFs) and Java classes </p>
+</td>
+</tr>
+<tr>
+<td>
+<p> script1-local.pig </p>
+</td>
+<td>
+<p> Pig Script 1, Query Phrase Popularity (local mode) </p>
+</td>
+</tr>
+<tr>
+<td>
+<p> script1-hadoop.pig </p>
+</td>
+<td>
+<p> Pig Script 1, Query Phrase Popularity (mapreduce mode) </p>
+</td>
+</tr>
+<tr>
+<td>
+<p> script2-local.pig </p>
+</td>
+<td>
+<p> Pig Script 2, Temporal Query Phrase Popularity (local mode)</p>
+</td>
+</tr>
+<tr>
+<td>
+<p> script2-hadoop.pig </p>
+</td>
+<td>
+<p> Pig Script 2, Temporal Query Phrase Popularity (mapreduce mode) </p>
+</td>
+</tr>
+<tr>
+<td>
+<p> excite-small.log </p>
+</td>
+<td>
+<p> Log file, Excite search engine (local mode) </p>
+</td>
+</tr>
+<tr>
+<td>
+<p> excite.log.bz2 </p>
+</td>
+<td>
+<p> Log file, Excite search engine (mapreduce) </p>
+</td>
+</tr>
+</table>
+
+
+<p>The user-defined functions (UDFs) are described here. </p>
+
+<table>
+<tr>
+<td>
+<p> <strong>UDF</strong> </p>
+</td>
+<td>
+<p> <strong>Description</strong></p>
+</td>
+</tr>
+<tr>
+<td>
+<p> ExtractHour </p>
+</td>
+<td>
+<p> Extracts the hour from the record.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p> NGramGenerator </p>
+</td>
+<td>
+<p> Composes n-grams from the set of words. </p>
+</td>
+</tr>
+<tr>
+<td>
+<p> NonURLDetector </p>
+</td>
+<td>
+<p> Removes the record if the query field is empty or a URL. </p>
+</td>
+</tr>
+<tr>
+<td>
+<p> ScoreGenerator </p>
+</td>
+<td>
+<p> Calculates a "popularity" score for the n-gram.</p>
+</td>
+</tr>
+<tr>
+<td>
+<p> ToLower </p>
+</td>
+<td>
+<p> Changes the query field to lowercase. </p>
+</td>
+</tr>
+<tr>
+<td>
+<p> TutorialUtil </p>
+</td>
+<td>
+<p> Divides the query string into a set of words.</p>
+</td>
+</tr>
+</table>
+
+</section>
+
+ <!-- ++++++++++++++++++++++++++++++++++ -->   
+<section>
+<title> Pig Script 1: Query Phrase Popularity</title>
+
+<p>The Query Phrase Popularity script (script1-local.pig or script1-hadoop.pig) processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day. </p>
+<p>The script is shown here: </p>
+<ul>
+<li><p> Register the tutorial JAR file so that the included UDFs can be called in the script. </p>
+</li>
+</ul>
+
+<source>
+REGISTER ./tutorial.jar; 
+</source>
+<ul>
+<li><p> Use the PigStorage function to load the excite log file (excite.log or excite-small.log) into the “raw” bag as an array of records with the fields <strong>user</strong>, <strong>time</strong>, and <strong>query</strong>.  </p>
+</li>
+</ul>
+
+<source>
+raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);
+</source>
+<ul>
+<li><p> Call the NonURLDetector UDF to remove records if the query field is empty or a URL.  </p>
+</li>
+</ul>
+
+<source>
+clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
+</source>
+<ul>
+<li><p> Call the ToLower UDF to change the query field to lowercase.  </p>
+</li>
+</ul>
+
+<source>
+clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;
+</source>
+<ul>
+<li><p> Because the log file only contains queries for a single day, we are only interested in the hour. The excite query log timestamp format is YYMMDDHHMMSS. Call the ExtractHour UDF to extract the hour (HH) from the time field. </p>
+</li>
+</ul>
+
+<source>
+houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;
+</source>
+<ul>
+<li><p> Call the NGramGenerator UDF to compose the n-grams of the query. </p>
+</li>
+</ul>
+
+<source>
+ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
+</source>
+<ul>
+<li><p> Use the DISTINCT operator to get the unique n-grams for all records.  </p>
+</li>
+</ul>
+
+<source>
+ngramed2 = DISTINCT ngramed1;
+</source>
+<ul>
+<li><p> Use the GROUP operator to group records by n-gram and hour. </p>
+</li>
+</ul>
+
+<source>
+hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
+</source>
+<ul>
+<li><p> Use the COUNTfunction to get the count (occurrences) of each n-gram.  </p>
+</li>
+</ul>
+
+<source>
+hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
+</source>
+<ul>
+<li><p> Use the GROUP operator to group records by n-gram only. Each group now corresponds to a distinct n-gram and has the count for each hour. </p>
+</li>
+</ul>
+
+<source>
+uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;
+</source>
+<ul>
+<li><p> For each group, identify the hour in which this n-gram is used with a particularly high frequency. Call the ScoreGenerator UDF to calculate a "popularity" score for the n-gram. </p>
+</li>
+</ul>
+
+<source>
+uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1));
+</source>
+<ul>
+<li><p> Use the FOREACH-GENERATE operator to assign names to the fields.  </p>
+</li>
+</ul>
+
+<source>
+uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean;
+</source>
+<ul>
+<li><p> Use the FILTER operator to move all records with a score less than or equal to 2.0. </p>
+</li>
+</ul>
+
+<source>
+filtered_uniq_frequency = FILTER uniq_frequency3 BY score &gt; 2.0;
+</source>
+<ul>
+<li><p> Use the ORDER operator to sort the remaining records by hour and score. </p>
+</li>
+</ul>
+
+<source>
+ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score;
+</source>
+<ul>
+<li><p> Use the PigStorage function to store the results. The output file contains a list of n-grams with the following fields: <strong>hour</strong>, <strong>ngram</strong>, <strong>score</strong>, <strong>count</strong>, <strong>mean</strong>. </p>
+</li>
+</ul>
+<source>
+STORE ordered_uniq_frequency INTO '/tmp/tutorial-results' USING PigStorage(); 
+</source>
+</section>
+
+ <!-- ++++++++++++++++++++++++++++++++++ -->   
+<section>
+<title>Pig Script 2: Temporal Query Phrase Popularity</title>
+
+<p>The Temporal Query Phrase Popularity script (script2-local.pig or script2-hadoop.pig) processes a search query log file from the Excite search engine and compares the occurrence of frequency of search phrases across two time periods separated by twelve hours. </p>
+<p>The script is shown here: </p>
+<ul>
+<li><p> Register the tutorial JAR file so that the user-defined functions (UDFs) can be called in the script. </p>
+</li>
+</ul>
+
+<source>
+REGISTER ./tutorial.jar;
+</source>
+<ul>
+<li><p> Use the PigStorage function to load the excite log file (excite.log or excite-small.log) into the “raw” bag as an array of records with the fields <strong>user</strong>, <strong>time</strong>, and <strong>query</strong>. </p>
+</li>
+</ul>
+
+<source>
+raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);
+</source>
+<ul>
+<li><p> Call the NonURLDetector UDF to remove records if the query field is empty or a URL. </p>
+</li>
+</ul>
+
+<source>
+clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
+</source>
+<ul>
+<li><p> Call the ToLower UDF to change the query field to lowercase. </p>
+</li>
+</ul>
+
+<source>
+clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;
+</source>
+<ul>
+<li><p> Because the log file only contains queries for a single day, we are only interested in the hour. The excite query log timestamp format is YYMMDDHHMMSS. Call the ExtractHour UDF to extract the hour from the time field. </p>
+</li>
+</ul>
+
+<source>
+houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;
+</source>
+<ul>
+<li><p> Call the NGramGenerator UDF to compose the n-grams of the query. </p>
+</li>
+</ul>
+
+<source>
+ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
+</source>
+<ul>
+<li><p> Use the DISTINCT operator to get the unique n-grams for all records.  </p>
+</li>
+</ul>
+
+<source>
+ngramed2 = DISTINCT ngramed1;
+</source>
+<ul>
+<li><p> Use the GROUP operator to group the records by n-gram and hour.  </p>
+</li>
+</ul>
+
+<source>
+hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
+</source>
+<ul>
+<li><p> Use the COUNT function to get the count (occurrences) of each n-gram.  </p>
+</li>
+</ul>
+
+<source>
+hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
+</source>
+<ul>
+<li><p> Use the FOREACH-GENERATE operator to assign names to the fields. </p>
+</li>
+</ul>
+
+<source>
+hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram, $1 as hour, $2 as count;
+</source>
+<ul>
+<li><p> Use the  FILTERoperator to get the n-grams for hour ‘00’  </p>
+</li>
+</ul>
+
+<source>
+hour00 = FILTER hour_frequency2 BY hour eq '00';
+</source>
+<ul>
+<li><p> Uses the FILTER operators to get the n-grams for hour ‘12’ </p>
+</li>
+</ul>
+
+<source>
+hour12 = FILTER hour_frequency3 BY hour eq '12';
+</source>
+<ul>
+<li><p> Use the JOIN operator to get the n-grams that appear in both hours. </p>
+</li>
+</ul>
+
+<source>
+same = JOIN hour00 BY $0, hour12 BY $0;
+</source>
+<ul>
+<li><p> Use the FOREACH-GENERATE operator to record their frequency. </p>
+</li>
+</ul>
+
+<source>
+same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as ngram, $2 as count00, $5 as count12;
+</source>
+<ul>
+<li><p> Use the PigStorage function to store the results. The output file contains a list of n-grams with the following fields: <strong>hour</strong>, <strong>count00</strong>, <strong>count12</strong>. </p>
+</li>
+</ul>
+
+<source>
+STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();
+</source>
+</section>
+</section>
+
+
+</body>
+</document>

Modified: pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml?rev=1050082&r1=1050081&r2=1050082&view=diff
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml (original)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml Thu Dec 16 18:10:59 2010
@@ -32,6 +32,6 @@
   -->
   <tab label="Project" href="http://hadoop.apache.org/pig/" type="visible" /> 
   <tab label="Wiki" href="http://wiki.apache.org/pig/" type="visible" /> 
-  <tab label="Pig 0.8.0 Documentation" dir="" type="visible" /> 
+  <tab label="Pig 0.9.0 Documentation" dir="" type="visible" /> 
 
 </tabs>



Mime
View raw message