pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "PigLatin" by OlgaN
Date Wed, 07 Nov 2007 02:34:08 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigLatin

------------------------------------------------------------------------------
- [[Anchor(Introduction)]]
+ [[Anchor(Introduction_to_Pig_Latin)]]
  == Introduction to Pig Latin ==
- 
- [[TableOfContents]]
  
  So you want to learn Pig Latin. Welcome! Lets begin with the data types.
  
@@ -12, +10 @@

  
  Every piece of data in Pig has one of these four types:
  
-    * A '''Data Atom''' is a simple atomic data value. It is stored as a string but can be
used as either a string or a number (see #Filter). Examples of data atoms are 'apache.org'
and '1.0'.
+    * A '''Data Atom''' is a simple atomic data value. It is stored as a string but can be
used as either a string or a number (see [#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
Filters]). Examples of data atoms are 'apache.org' and '1.0'.
     * A '''Tuple''' is a data record consisting of a sequence of "fields". Each field is
a piece of data of any type (data atom, tuple or data bag). We denote tuples with < >
bracketing. An example of a tuple is <apache.org,1.0>.
     * A '''Data Bag''' is a set of tuples (duplicate tuples are allowed). You may think of
it as a "table", except that Pig does not require that the tuple field  types match, or even
that the tuples have the same number of fields! (It is up to you whether you want these properties.)
We denote bags by { } bracketing. Thus, a data bag could be {<apache.org,1.0>, <flickr.com,0.8>}
     * A '''Data Map''' is a map from keys that are string literals to values that can be
any data type. Think of it as a !HashMap<String,X> where X can be any of the 4 pig data
types. A Data Map supports the expected get and put interface. We denote maps by [ ] bracketing,
with ":" separating the key and the value, and ";" separating successive key value pairs.
Thus. a data map could be [ 'apache' : <'search', 'news'> ; 'cnn' : 'news' ]. Here,
the key 'apache' is mapped to the tuple with 2 atomic fields 'search' and 'news', while the
key 'cnn' is mapped to the data atom 'news'.
@@ -24, +22 @@

  {{{
  t = < 1, {<2,3>,<4,6>,<5,7>}, ['apache':'search']>
  }}}
- Thus, =t= has 3 fields. Let these fields have names f1, f2, f3. Field f1 is an atom with
value 1. Field f2 is a bag having 3 tuples. Field f3 is a data map having 1 key.
+ Thus, `t` has 3 fields. Let these fields have names f1, f2, f3. Field f1 is an atom with
value 1. Field f2 is a bag having 3 tuples. Field f3 is a data map having 1 key.
  
  The following table lists the various methods of referring to data.
  
- || Method of Referring to Data || Example || Value for example tuple '''t''' || Notes ||
+ || Method of Referring to Data || Example || Value for example tuple `t` || Notes ||
- || '''Constant''' || ''''1.0'''', or ''''apache.org'''', or ''''blah'''' || Value constant
irrespective of '''t''' || ||
+ || '''Constant''' || ''''1.0'''', or ''''apache.org'''', or ''''blah'''' || Value constant
irrespective of `t` || ||
  || '''Field referred to by position''' || '''$0''' || Data Atom '1' || '''In Pig, positions
start at 0 and not 1''' ||
  || '''Field referred to by name''' || '''f2'''|| Bag {<2,3>,<4,6>,<5,7>}
|| ||
  || '''Projection''' of another data item || '''f2.$0''' || Bag {<2>,<4>,<5>}
- the bag f2 projected to the first field || ||
  || '''Map Lookup''' against another data item || '''f3#'apache'''' || Data Atom 'search'
||* User's responsibility to ensure that a lookup is written only against a  data map, otherwise
a runtime error is thrown <br>   * If the key being looked up does not exist, a Data
Atom with an empty string is returned||
  || '''Function''' applied to another data item || '''SUM(f2.$0)''' || 2+4+5 = 11 || SUM
is a builtin Pig function. See PigFunctions for how to write your own functions ||
  || '''Infix Expression''' of other data items || '''COUNT(f2) + f1 / '2.0'''' || 3 + 1 /
2.0 = 3.5 ||  ||
- || '''Bincond''', i.e., the value of the data item is chosen according to some condition
|| '''(f1 = =  '1' ? '2' : COUNT(f2))''' || '2' since f1=='1' is true. If f1 were != '1',
then the value of this data item for t would be COUNT(f2)=3 || See [[#CondS][Conditions]]
for what the format of the condition in the bincond can be ||
+ || '''Bincond''', i.e., the value of the data item is chosen according to some condition
|| '''(f1 = =  '1' ? '2' : COUNT(f2))''' || '2' since f1=='1' is true. If f1 were != '1',
then the value of this data item for t would be COUNT(f2)=3 || See [#Specifying_Conditions
Conditions] for what the format of the condition in the bincond can be ||
  
  
- ===Pig Latin Statements ===
+ === Pig Latin Statements ===
  
  A Pig Latin statement is a command that produces a '''Relation'''. A relation is simply
a data bag with a name. That name is called the relation's '''alias'''. The simplest Pig Latin
statement is LOAD, which reads a relation from a file in the file system. Other Pig Latin
statements process one or more input relations, and produce a new relation as a result.
  
@@ -48, +46 @@

  Examples:
  
  {{{
- grunt> A = load 'mydoc' using PigStorage()
+ grunt> A = load 'mydata' using PigStorage()
  as (a, b, c);
  grunt>B = group A by a;
  grunt> C = foreach B {
@@ -57, +55 @@

  }
  grunt> 
  }}}
-  
- [[Anchor(Load)]]
+ 
+ [[Anchor(LOAD:_Loading_data_from_a_file)]]
  ==== LOAD: Loading data from a file ====
  
  Before you can do any processing, you first need to load the data. This is done by the LOAD
statement. Suppose we have a tab-delimited file called "myfile.txt" that contains a relation,
whose contents are:
@@ -78, +76 @@

  A = LOAD 'myfile.txt' USING PigStorage('\t') AS (f1,f2,f3);
  }}}
  
- Here, PigStorage is the name of a "storage function" that takes care of parsing the file
into a Pig relation. This storage function expects simple newline-separated records with delimiter-separated
fields; it has one parameter, namely the field delimiter(s).  
+ Here, !PigStorage is the name of a "storage function" that takes care of parsing the file
into a Pig relation. This storage function expects simple newline-separated records with delimiter-separated
fields; it has one parameter, namely the field delimiter(s).  
  
  Future Pig Latin commands can refer to the alias "A" and will receive data that has been
loaded from "myfile.txt". A will contain this data:
  
@@ -99, +97 @@

     * If your records are stored in some special format that our functions can't parse, you
can of course write your own storage function (see PigFunctions).
     * In Pig, relations are ''unordered'', which means we do not guarantee that tuples are
processed in any particular order. (In fact, processing may be parallelized, in which case
tuples are not processed according to ''any'' total ordering.)
     * If you pass a directory name to LOAD, it will load all files within the directory.
-    * You can use hadoop supported globbing to specify a file or list of files to load. 
See http://lucene.apache.org/hadoop/api/org/apache/hadoop/fs/FileSystem.html#globPaths(org.apache.hadoop.fs.Path)][
the hadoop glob documentation for details on globbing syntax.  Globs can be used at the file
system or directory levels.  (This functionality is available as of pig 1.1e.)
+    * You can use hadoop supported globbing to specify a file or list of files to load. 
See [http://lucene.apache.org/hadoop/api/org/apache/hadoop/fs/FileSystem.html#globPaths(org.apache.hadoop.fs.Path)the
hadoop glob documentation] for details on globbing syntax.  Globs can be used at the file
system or directory levels.  (This functionality is available as of pig 1.1e.)
    
- [[Anchor(Filter)]]
+ 
+ [[Anchor(FILTER:_Getting_rid_of_data_you_are_not_interested_in_)]]
  ==== FILTER: Getting rid of data you are not interested in  ====
  Very often, the first thing that you want to do with data is to get rid of tuples that you
are not interested in. This can be done by the filter statement. For example,
  
@@ -116, +115 @@

  <8, 4, 3>
  }}}
  
- [[Anchor(Condition)]]
+ [[Anchor(Specifying_Conditions)]]
  ===== Specifying Conditions =====
  The condition following the keyword BY can be much more general than as shown above. 
     * The logical connectives AND, OR and NOT can be used to build a condition from various
atomic conditions. 
-    * Each atomic condition can be of the form `&lt;Data Item&gt; &lt;compOp&gt;
&lt;Data Item&gt;` (see [[#DataItems][Data Items]] for what the format of data items
can be). 
+    * Each atomic condition can be of the form `&lt;Data Item&gt; &lt;compOp&gt;
&lt;Data Item&gt;` (see [#Data_Items Data Items] for what the format of data items
can be). 
     * The comparison operator compOp can be one of 
        * '''==, <nop>!=, >, >=, <, or <=''' for '''numerical''' comparisons.
'''Note that if these operators are used on non-numeric data, a runtime error will be thrown'''.
        * '''eq, neq, gt, gte, lt, or lte''' for string comparisons
-       * '''matches''' for regular expression matching, e.g., $0 matches "*apache*". The
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html][format of regular expressions
is that supported by Java.
+       * '''matches''' for regular expression matching, e.g., $0 matches "*apache*". The
[http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html format] of regular expressions
is that supported by Java.
  
  Thus, a somewhat more complicated condition can be
  {{{
@@ -132, +131 @@

  }}}
  
  Note:
-    * If you want to get rid of specifc columns or fields, rather than whole tuples, you
should use the [[#ForeachS][FOREACH]] statement and not the filter statement.
+    * If you want to get rid of specifc columns or fields, rather than whole tuples, you
should use the [#FOREACH_..._GENERATE:_Applying_transformations_to_the_data FOREACH] statement
and not the filter statement.
     * If the builtin comparison operators are not sufficient for your needs, you can write
your own '''filter function''' (see PigFunctions for details). Suppose you wrote a new equality
function (say myEquals). Then the first example above can be written as `Y = FILTER A BY myEquals(f1,'8');`
  
- [[Anchor(Cogroup)]]
+ [[Anchor(COGROUP:_Getting_the_relevant_data_together)]]
  ==== COGROUP: Getting the relevant data together ====
  
  We can group the tuples in A according to some specification. A simple specification is
to group according to the value of one of the fields, e.g. the first field. This is done as
follows:
@@ -241, +240 @@

     * If the criteria on which the grouping has to be performed is more complicated that
just the values of some fields, you can write your own Group Function, say myGroupFunc. Then
we can write `GROUP A by myGroupFunc(*)`. Here "*" is a shorthand for all fields in the tuple.
See PigFunctions for details.
     * A Group function can return multiple values for a tuple, i.e., a single tuple can belong
to multiple groups. 
  
- [[Anchor(Foreach)]]
+ 
+ [[Anchor(FOREACH_..._GENERATE:_Applying_transformations_to_the_data)]]
  ==== FOREACH ... GENERATE: Applying transformations to the data ====
- The FOREACH statement is used to apply transformations to the data and to generate new [[#DataItems][data
items]]. The basic syntax is
+ The FOREACH statement is used to apply transformations to the data and to generate new [#Data_Items
data items]. The basic syntax is
  
  `<output-alias> = FOREACH <input-alias> GENERATE <data-item 1>, <data-item
2>, ... ;`
  
@@ -419, +419 @@

  
  <i>Note:</i> On flattening, we might end with fields that have the same name
but which came from different tables. They are disambiguated by prepending `<alias>::`
to their names. See PigLatinSchemas.
  
- [[Anchor(Order)]]
+ [[Anchor(ORDER:_Sorting_data_according_to_some_fields)]]
  ==== ORDER: Sorting data according to some fields ====
  We can sort the contents of any alias according to any set of columns. For example,
  
- <blockquote>
  {{{
  X = ORDER A BY $2;
  }}}
- </blockquote>
  
  One possible output (since ties are resolved arbitrarily) is X =
  {{{
@@ -441, +439 @@

  
  Notes:
     * From the point of view of the Pig data model, A and X contain the same thing (since
we mentioned earlier that relations are logically unordered). If you process X further, there
is no guarantee that tuples will be processes in order.
-    * However, the only guarantee is that if we retrieve the contents of X (see [[#RetrievingR][Retreiving
Results]]), they are guaranteed to be in order of $2 (the third field).
+    * However, the only guarantee is that if we retrieve the contents of X (see [#Retrieving_Results
Retreiving Results]), they are guaranteed to be in order of $2 (the third field).
     * To sort according to the combination of all columns, you can write `ORDER A by *` 
  
- [[Anchor(Distinct)]]
+ [[Anchor(DISTINCT:_Eliminating_duplicates_in_data)]]
  ==== DISTINCT: Eliminating duplicates in data ====
  We can eliminate duplicates in the contents of any alias. For example, suppose we first
say
  
@@ -465, +463 @@

  
  Now, if we say
  
- <blockquote>
  {{{
  Y = DISTINCT X;
  }}}
- </blockquote>
  
  The output is Y =
  
@@ -481, +477 @@

  
  Notes:
     * Note that original order is not preserved (another illustration of the fact that Pig
relations are unordered). In fact, to eliminate duplicates, the input will be first sorted.

-    * You can '''not''' request for distinct on a subset of the columns. This can be done
by [[#ProjectS][projection]] followed by the DISTINCT statement as in the above example.
+    * You can '''not''' request for distinct on a subset of the columns. This can be done
by [#Projection projection] followed by the DISTINCT statement as in the above example.
  
  
- [[Anchor(Cross)]]
+ [[Anchor(CROSS:_Computing_the_cross_product_of_multiple_relations)]]
  ==== CROSS: Computing the cross product of multiple relations ====
  
  To compute the cross product (also known as "cartesian product") of two or more relations,
use:
@@ -511, +507 @@

  Notes:
     * This is an expensive operation and should not be usually necessary.
  
- [[Anchor(Union)]]
+ [[Anchor(UNION:_Computing_the_union_of_multiple_relations)]]
  ==== UNION: Computing the union of multiple relations ====
  
  We can vertically glue together contents of multiple aliases into a single alias by the
UNION command. For example,
@@ -545, +541 @@

        * be able to handle the different kinds of tuples while processing the result of the
union.
     * UNION does not eliminate duplicate tuples.
  
- [[Anchor(Split)]]
+ [[Anchor(SPLIT:_Separating_data_into_different_relations)]]
  ==== SPLIT: Separating data into different relations ====
  The SPLIT statement, in some sense, is the converse of the UNION statement. It is used to
partition the contents of a relation into multiple relations based on desired conditions.

  
@@ -577, +573 @@

     * This construct is useful if you want to logically output multiple things from your
function. You can then attach a field to the output of your function, and later split on that
field to get the multiple outputs.
     * One tuple can go to multiple partitions, e.g., the <4, 2, 1> tuple above.
     * A tuple might also go to none of the partitions, if it doesn't satisfy any of the conditions,
e.g., the <7, 2, 5> tuple above.
-    * [[#CondS][Conditions]] can be specified as mentioned in the Filter statement.
+    * [#Specifying_Conditions Conditions] can be specified as mentioned in the Filter statement.
  
  
  [[Anchor(Nested_Operations_in_FOREACH...GENERATE)]]
  ==== Nested Operations in FOREACH...GENERATE ====
  If one of the fields in the input relation is a data bag, the nested data bag can be treated
as an '''inner''' or a '''nested relation'''. Consequently, in a FOEACH...GENERATE statement,
we can perform many of the operations on this nested relation that we can on a regular relation.

  
- The specific operations that we can do on the nested relations are [[#FilterS][FILTER]],
[[#OrderS][ORDER]], and [[#DistinctS][DISTINCT]]. Note that we do not allow FOREACH...GENERATE
on the nested relation, since that leads to the possibility of arbitrary number of nesting
levels. 
+ The specific operations that we can do on the nested relations are [#FILTER:_Getting_rid_of_data_you_are_not_interested_in_
FILTER], [#ORDER:_Sorting_data_according_to_some_fields ORDER], and [#DISTINCT:_Eliminating_duplicates_in_data
DISTINCT]. Note that we do not allow FOREACH...GENERATE on the nested relation, since that
leads to the possibility of arbitrary number of nesting levels. 
  
  The syntax for doing the nested operations is very similar to the regular syntax and is
demonstrated by the following example:
  
@@ -605, +601 @@

     * Within the nested block, one can do nested filering, projection, sorting, and duplicate
elimination.
  
  
- [[Anchor(Increasing_parallelism)]]
+ [[Anchor(Increasing_the_parallelism)]]
  === Increasing the parallelism ===
  
  To increase the parallelism of a job, include the PARALLEL clause in any of your Pig latin
statements.
@@ -634, +630 @@

     * In the current (1.2) and earlier releases, storage functions are case sensitive. This
will get changes in the future releases.
     * !PigStorage can only store flat tuples, i.e., tuples having atomic fields. If you want
to store nested data, use !BinStorage instead.
  
- [[Anchor(Experimenting)]]
+ [[Anchor(Experimenting_with_Pig_Latin_syntax)]]
  === Experimenting with Pig Latin syntax ===
  
  To experiment with the Pig Latin syntax, you can use the !StandAloneParser. Invoke it by
the following command:
  
- <blockquote>
  {{{
  java -cp pig.jar org.apache.pig.StandAloneParser
  }}}
- </blockquote>
  
  
  Example usage:
@@ -658, +652 @@

  ---- Query parsed successfully ---
  Current aliases: A->null, 
  > D = FOREACH C blah blah blah;
- Parse error: org.apache..pig.impl.logicalLayer.parser.ParseException: Encountered "blah"
at line 1, column 15.
+ Parse error: org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered "blah"
at line 1, column 15.
  Was expecting one of:
      "generate" ...
      "{" ...

Mime
View raw message