pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From o...@apache.org
Subject svn commit: r1024001 - in /pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs: pigunit.xml udf.xml
Date Mon, 18 Oct 2010 20:45:10 GMT
Author: olga
Date: Mon Oct 18 20:45:09 2010
New Revision: 1024001

URL: http://svn.apache.org/viewvc?rev=1024001&view=rev
PIG-1600: Docs update (chandec via olgan)


Modified: pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/pigunit.xml
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/pigunit.xml?rev=1024001&r1=1024000&r2=1024001&view=diff
--- pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/pigunit.xml (original)
+++ pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/pigunit.xml Mon Oct 18
20:45:09 2010
@@ -97,9 +97,8 @@ STORE queries_limit INTO '$output';
         Many examples are available in the
-          href="http://svn.apache.org/viewvc/hadoop/pig/trunk/test/org/apache/pig/test/pigunit/TestPigTest.java"
-        >PigUnit tests</a>
-        .
+          href="http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/test/pigunit/TestPigTest.java"
+        >PigUnit tests</a>.

Modified: pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/udf.xml
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/udf.xml?rev=1024001&r1=1024000&r2=1024001&view=diff
--- pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/udf.xml (original)
+++ pig/branches/branch-0.8/src/docs/src/documentation/content/xdocs/udf.xml Mon Oct 18 20:45:09
@@ -144,7 +144,7 @@ DUMP C;
 <p>An aggregate function is an eval function that takes a bag and returns a scalar
value. One interesting and useful property of many aggregate functions is that they can be
computed incrementally in a distributed fashion. We call these functions <code>algebraic</code>.
<code>COUNT</code> is an example of an algebraic function because we can count
the number of elements in a subset of the data and then sum the counts to produce a final
output. In the Hadoop world, this means that the partial computations can be done by the map
and combiner, and the final result can be computed by the reducer. </p>
-<p>It is very important for performance to make sure that aggregate functions that
are algebraic are implemented as such. Let's look at the implementation of the COUNT function
to see what this means. (Error handling and some other code is omitted to save space. The
full code can be accessed <a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/COUNT.java?view=markup">
+<p>It is very important for performance to make sure that aggregate functions that
are algebraic are implemented as such. Let's look at the implementation of the COUNT function
to see what this means. (Error handling and some other code is omitted to save space. The
full code can be accessed <a href="http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/builtin/COUNT.java?view=markup">
 public class COUNT extends EvalFunc&lt;Long&gt; implements Algebraic{
@@ -343,7 +343,7 @@ Java Class
-<p>All Pig-specific classes are available <a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/data/">
here</a>. </p>
+<p>All Pig-specific classes are available <a href="http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/data/">
here</a>. </p>
 <p><code>Tuple</code> and <code>DataBag</code> are different
in that they are not concrete classes but rather interfaces. This enables users to extend
Pig with their own versions of tuples and bags. As a result, UDFs cannot directly instantiate
bags or tuples; they need to go through factory classes: <code>TupleFactory</code>
and <code>BagFactory</code>. </p>
 <p>The builtin <code>TOKENIZE</code> function shows how bags and tuples
are created. A function takes a text string as input and returns a bag of words from the text.
(Note that currently Pig bags always contain tuples.) </p>
@@ -749,14 +749,14 @@ This enables Pig users/developers to cre
 <title> Load Functions</title>
-<p><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup">LoadFunc</a>

+<p><a href="http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup">LoadFunc</a>

 abstract class has the main methods for loading data and for most use cases it would suffice
to extend it. There are three other optional interfaces which can be implemented to achieve
extended functionality: </p>
-<li><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadMetadata.java?view=markup">LoadMetadata</a>

+<li><a href="http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadMetadata.java?view=markup">LoadMetadata</a>

 has methods to deal with metadata - most implementation of loaders don't need to implement
this unless they interact with some metadata system. The getSchema() method in this interface
provides a way for loader implementations to communicate the schema of the data back to pig.
If a loader implementation returns data comprised of fields of real types (rather than DataByteArray
fields), it should provide the schema describing the data returned through the getSchema()
method. The other methods are concerned with other types of metadata like partition keys and
statistics. Implementations can return null return values for these methods if they are not
applicable for that implementation.</li>
-<li><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadPushDown.java?view=markup">LoadPushDown</a>

+<li><a href="http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadPushDown.java?view=markup">LoadPushDown</a>

 has methods to push operations from Pig runtime into loader implementations. Currently only
the pushProjection() method is called by Pig to communicate to the loader the exact fields
that are required in the Pig script. The loader implementation can choose to honor the request
(return only those fields required by Pig script) or not honor the request (return all fields
in the data). If the loader implementation can efficiently honor the request, it should implement
LoadPushDown to improve query performance. (Irrespective of whether the implementation can
or cannot honor the request, if the implementation also implements getSchema(), the schema
returned in getSchema() should describe the entire tuple of data.)
 	<li>pushProjection(): This method tells LoadFunc which fields are required in the
Pig script, thus enabling LoadFunc to optimize performance by loading only those fields that
are needed. pushProjection() takes a RequiredFieldList. RequiredFieldList includes a list
of RequiredField: each RequiredField indicates a field required by the Pig script; each RequiredField
includes index, alias, type (which is reserved for future use), and subFields. Pig will use
the column index RequiredField.index to communicate with the LoadFunc about the fields required
by the Pig script. If the required field is a map, Pig will optionally pass RequiredField.subFields
which contains a list of keys that the Pig script needs for the map. For example, if the Pig
script needs two keys for the map, "key1" and "key2", the subFields for that map will contain
two RequiredField; the alias field for the first RequiredField will be "key1" and the alias
for the second RequiredField will be "key2". LoadFunc 
 will use RequiredFieldResponse.requiredFieldRequestHonored to indicate whether the pushProjection()
request is honored.
@@ -764,7 +764,7 @@ has methods to push operations from Pig 
-<li><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadCaster.java?view=markup">LoadCaster</a>

+<li><a href="http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadCaster.java?view=markup">LoadCaster</a>

 has methods to convert byte arrays to specific types. A loader implementation should implement
this if casts (implicit or explicit) from DataByteArray fields to other types need to be supported.
@@ -906,10 +906,10 @@ public class SimpleTextLoader extends Lo
 <title> Store Functions</title>
-<p><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/StoreFunc.java?view=markup">StoreFunc</a>

+<p><a href="http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/StoreFunc.java?view=markup">StoreFunc</a>

 abstract class has the main methods for storing data and for most use cases it should suffice
to extend it. There is an optional interface which can be implemented to achieve extended
functionality: </p>
-<li><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/StoreMetadata.java?view=markup">StoreMetadata:</a>

+<li><a href="http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/StoreMetadata.java?view=markup">StoreMetadata:</a>

 This interface has methods to interact with metadata systems to store schema and store statistics.
This interface is truely optional and should only be implemented if metadata needs to stored.
@@ -1327,13 +1327,13 @@ register 'test.py' using org.apache.pig.
 <p>A typical test.py looks like this:</p>
 def helloworld():  
-  return ('Hello, World')
+  return 'Hello, World'
-def complex(word):  
-  return (str(word),long(word)*long(word))
+def complex(word):
+  return str(word),len(word)
 def square(num):
@@ -1396,7 +1396,7 @@ def squareSchema(input):
   return input
 #Percent- Percentage
 def percent(num, total):
   return num * 100 / total
@@ -1404,12 +1404,12 @@ def percent(num, total):
 # String Functions #
 #commaFormat- format a number with commas, 12345-> 12,345
 def commaFormat(num):
   return '{:,}'.format(num)
 #concatMultiple- concat multiple words
 def concatMult4(word1, word2, word3, word4):
   return word1+word2+word3+word4
@@ -1418,7 +1418,7 @@ def concatMult4(word1, word2, word3, wor
 #collectBag- collect elements of a bag into other bag
 #This is useful UDF after group operation
 def collectBag(bag):
   outBag = []
   for word in bag:

View raw message