pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "PigLatin" by OlgaN
Date Tue, 06 Nov 2007 00:57:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigLatin

------------------------------------------------------------------------------
  
  The following table lists the various methods of referring to data.
  
- || Method of Referring to Data || Example || Value for example tuple =t= || Notes ||
+ || Method of Referring to Data || Example || Value for example tuple '''t''' || Notes ||
- || '''Constant''' || ''''1.0'''', or ''''apache.org'''', or ''''blah'''' || Value constant
irrespective of =t= || ||
+ || '''Constant''' || ''''1.0'''', or ''''apache.org'''', or ''''blah'''' || Value constant
irrespective of '''t''' || ||
  || '''Field referred to by position''' || '''$0''' || Data Atom '1' || '''In Pig, positions
start at 0 and not 1''' ||
- || '''Field referred to by name''' || *f2*|| Bag {<2,3>,<4,6>,<5,7>} ||
||
+ || '''Field referred to by name''' || '''f2'''|| Bag {<2,3>,<4,6>,<5,7>}
|| ||
  || '''Projection''' of another data item || '''f2.$0''' || Bag {<2>,<4>,<5>}
- the bag f2 projected to the first field || ||
  || '''Map Lookup''' against another data item || '''f3#'apache'''' || Data Atom 'search'
||* User's responsibility to ensure that a lookup is written only against a  data map, otherwise
a runtime error is thrown <br>   * If the key being looked up does not exist, a Data
Atom with an empty string is returned||
  || '''Function''' applied to another data item || '''SUM(f2.$0)''' || 2+4+5 = 11 || SUM
is a builtin Pig function. See PigFunctions for how to write your own functions ||
@@ -74, +74 @@

  
  Suppose we want to refer to the 3 fields as f1, f2, and f3. We can load this relation using
the following command:
  
- <blockquote><verbatim>
+ {{{
  A = LOAD 'myfile.txt' USING PigStorage('\t') AS (f1,f2,f3);
- </verbatim></blockquote>
+ }}}
  
- <noautolink>
  Here, PigStorage is the name of a "storage function" that takes care of parsing the file
into a Pig relation. This storage function expects simple newline-separated records with delimiter-separated
fields; it has one parameter, namely the field delimiter(s).  
- </noautolink>
  
  Future Pig Latin commands can refer to the alias "A" and will receive data that has been
loaded from "myfile.txt". A will contain this data:
  
@@ -107, +105 @@

  ==== FILTER: Getting rid of data you are not interested in  ====
  Very often, the first thing that you want to do with data is to get rid of tuples that you
are not interested in. This can be done by the filter statement. For example,
  
- <blockquote><verbatim>
+ {{{
  Y = FILTER A BY f1 == '8';
- </verbatim></blockquote>
+ }}}
  
  The result is Y =
  
@@ -129, +127 @@

        * '''matches''' for regular expression matching, e.g., $0 matches "*apache*". The
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html][format of regular expressions
is that supported by Java.
  
  Thus, a somewhat more complicated condition can be
- <blockquote><verbatim>
+ {{{
  Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1));
- </verbatim></blockquote>
+ }}}
  
  Note:
     * If you want to get rid of specifc columns or fields, rather than whole tuples, you
should use the [[#ForeachS][FOREACH]] statement and not the filter statement.
@@ -142, +140 @@

  
  We can group the tuples in A according to some specification. A simple specification is
to group according to the value of one of the fields, e.g. the first field. This is done as
follows:
  
- <blockquote><verbatim>
+ {{{
  X = GROUP A BY f1;
  X = GROUP A BY (f1, f2 ..);
- </verbatim></blockquote>
+ }}}
  
  The result of the group statement consists of one tuple for each group. The first field
of the tuple has name `group` and has the value on which the grouping has been performed,
and the second field has name A and is a bag containing the tuples belonging to that group.
Thus, X = :
  
@@ -170, +168 @@

  
  We can ''co-group'' A and B, which means that we jointly group the tuples from A and B,
using this command:
  
- <blockquote><verbatim>
+ {{{
  COGROUP A BY f1, B BY $0;
- </verbatim></blockquote>
+ }}}
  
  You can co-group by multiple columns the same way as for group.
  
@@ -190, +188 @@

  
  Note that some of the bags are empty, which indicates that no tuples from the corresponding
input belong to that group. If we only wish to see groups for which <i>both</i>
inputs have at least one tuple, we can write:
  
- <blockquote><verbatim>
+ {{{
  C = COGROUP A BY $0 INNER, B BY $0 INNER;
- </verbatim></blockquote>
+ }}}
  
  The result is C = 
  
@@ -206, +204 @@

  
  In addition to using columns to group the data, an arbitrary expression can be used:
  
- <blockquote><verbatim>
+ {{{
  grunt> cat a	    
  r1	1	2
  r2	2	1
@@ -234, +232 @@

  (2.0, {(r1, 1, 2), (r2, 2, 1)})
  (16.0, {(r3, 2, 8), (r4, 4, 4)})
  grunt> 
- </verbatim></blockquote>
+ }}}
  
  Note: 
     * If we want all tuples to go to a single group, e.g., when doing aggregates across entire
relations, we can write `GROUP A ALL`.
@@ -256, +254 @@

  
  To select a subset of columns from a relation, use this command:
  
- <blockquote><verbatim>
+ {{{
  X = FOREACH A GENERATE f1, f2;
- </verbatim></blockquote>
+ }}}
  
  X contains tuples from A, but with only the first and second fields present in each tuple.
For the value of A given above, X =
  
@@ -280, +278 @@

  
  As with SQL, asterisk (*) is shorthand for all columns. For example, with:
  
- <blockquote><verbatim>
+ {{{
  X = FOREACH A GENERATE *;
- </verbatim></blockquote>
+ }}}
  
  X is identical to A.
  
@@ -291, +289 @@

  
  If one of the fields in the input relation, is a non-atomic field, we can perform projection
on that field. For example, 
  
- <blockquote><verbatim>
+ {{{
  FOREACH C GENERATE group, B.$1;
- </verbatim></blockquote>
+ }}}
  
  The result is:
  
@@ -305, +303 @@

  
  Here is another example, in which multiple nested columns are retained:
  
- <blockquote><verbatim>
+ {{{
  FOREACH C GENERATE group, A.(f1, f2);
- </verbatim></blockquote>
+ }}}
  
  The result is:
  
@@ -322, +320 @@

  
  Pig has a number of built-in functions. An example is the SUM() function, which takes the
sum of a set of numbers in a bag. For example:
  
- <blockquote><verbatim>
+ {{{
  FOREACH C GENERATE group, SUM(A.f1);
- </verbatim></blockquote>
+ }}}
  
  gives:
  
@@ -341, +339 @@

  
  Sometimes we want to eliminate nesting. This can be accomplished via the FLATTEN keyword
which can be attached before any valid data item. For example:
  
- <blockquote><verbatim>
+ {{{
  FOREACH C GENERATE group, FLATTEN(A);
- </verbatim></blockquote>
+ }}}
  
  yields:
  
@@ -357, +355 @@

  
  As another example,
  
- <blockquote><verbatim>
+ {{{
  FOREACH C GENERATE group, FLATTEN(A.f3);
- </verbatim></blockquote>
+ }}}
  
  yields:
  
@@ -373, +371 @@

  
  As a final example,
  
- <blockquote><verbatim>
+ {{{
  FOREACH C GENERATE flatten(A.(f1, f2)), flatten(B.$1);
- </verbatim></blockquote>
+ }}}
  
  yields:
  
@@ -396, +394 @@

  
  The equi-join of A and B on column 0 can be expressed as follows:
  
- <blockquote><verbatim>
+ {{{
  JOIN A BY $0, B BY $0;
- </verbatim></blockquote>
+ }}}
  
  which is equivalent to:
  
- <blockquote><verbatim>
+ {{{
  X = COGROUP A BY $0 INNER, B BY $0 INNER;
  FOREACH X GENERATE FLATTEN(A), FLATTEN(B);
- </verbatim></blockquote>
+ }}}
  
  The result is:
  
@@ -491, +489 @@

  
  To compute the cross product (also known as "cartesian product") of two or more relations,
use:
  
- <blockquote><verbatim>
+ {{{
  X = CROSS A, B;
- </verbatim></blockquote>
+ }}}
  
  Based on the values of A and B given earlier in the document, the result is X =
  
@@ -518, +516 @@

  
  We can vertically glue together contents of multiple aliases into a single alias by the
UNION command. For example,
  
- <blockquote><verbatim>
+ {{{
  X = UNION A, B;
- </verbatim></blockquote>
+ }}}
  
  The result is X =
  
@@ -554, +552 @@

  
  An example of a SPLIT statement is the following,
  
- <blockquote><verbatim>
+ {{{
  SPLIT A INTO X IF $0 < 7, Y IF ($0 > 2 AND $0<> 7);
- </verbatim></blockquote>
+ }}}
  
  The output is 
  
@@ -590, +588 @@

  
  The syntax for doing the nested operations is very similar to the regular syntax and is
demonstrated by the following example:
  
- <blockquote><verbatim>
+ {{{
  W = LOAD '...' AS (url, outlink);
  G = GROUP W by url;
  R = FOREACH G {
@@ -599, +597 @@

  	DW = DISTINCT PW;
  	GENERATE group, COUNT(DW);
  }
- </verbatim></blockquote>
+ }}}
  
  Notes:
     * Note the nested block within the FOREACH...GENERATE statement. The syntax is the same
as regular Pig Latin syntax.

Mime
View raw message