In this example A is a relation or bag of tuples. You can think of this bag as an outer bag.

-A = LOAD 'data' as (f1:int, f2:int, f3;int); +A = LOAD 'data' as (f1:int, f2:int, f3:int); DUMP A; (1,2,3) (4,2,1) @@ -1629,7 +1629,7 @@ a: Schema for a unknown

Having a deterministic schema is very powerful; however, sometimes it comes at the cost of performance. Consider the following example:

-A = load âinputâ as (x, y, z); +A = load 'input' as (x, y, z); B = foreach A generate x+y; @@ -5120,7 +5120,7 @@ X = FILTER A BY (f1 matches '.*apache.*' A = load 'students' as (name:chararray, age:int, gpa:float); B = foreach A generate (name, age); -store B into âresultsâ; +store B into 'results'; Input (students): joe smith 20 3.5 @@ -5138,7 +5138,7 @@ Output (results): A = load 'students' as (name:chararray, age:int, gpa:float); B = foreach A generate {(name, age)}, {name, age}; -store B into âresultsâ; +store B into 'results'; Input (students): joe smith 20 3.5 @@ -5156,7 +5156,7 @@ Output (results): A = load 'students' as (name:chararray, age:int, gpa:float); B = foreach A generate [name, gpa]; -store B into âresultsâ; +store B into 'results'; Input (students): joe smith 20 3.5 @@ -7522,7 +7522,7 @@ DUMP A; A = LOAD 'myfile.txt' AS (f1:int, f2:int, f3:int); -A = LOAD 'myfile.txt' USING PigStorage(â\tâ) AS (f1:int, f2:int, f3:int); +A = LOAD 'myfile.txt' USING PigStorage('\t') AS (f1:int, f2:int, f3:int); DESCRIBE A; a: {f1: int,f2: int,f3: int} @@ -8922,8 +8922,8 @@ B = FOREACH A GENERATE myFunc(\$0); -
- REGISTER +
+ REGISTER (a jar/script)

Registers a JAR file so that the UDFs in the file can be used.

@@ -8995,7 +8995,222 @@ register jars/*.jar
-
+ +
+ + REGISTER (an artifact) + +

+ Instead of figuring out the dependencies manually, downloading them and registering each jar using the above + register command, you can specify the artifact's coordinates and expect pig to automatically + fetch the required dependencies, download and register them. +

+ + +
+ Syntax +

+ To download an Artifact (and its dependencies), you need to specify the artifact's group, module and version following + the syntax shown below. This command will download the Jar specified and all its dependencies and load it into the + classpath. +

+ + + + +
 + REGISTER ivy://group:module:version?querystring +
+
+ + +
+ Terms + + + + + + + + + + + + + + + + + +
 + group + + Which module group the module comes from. Translates directly to a Maven groupId or an Ivy Organization. + + module + + The name of the module to load. Translated directly to a Maven artifactId or an Ivy artifact. + + version + + The version of the module to use. You can specify a specific version or use "+" or "*" to use the latest version. + + querystring + + This will contain "&" separated key-value pairs to help us exclude all or specific dependencies etc. +
+
+ +
+ Usage + +

+ The Register artifact command is an extension to the above register command used to register a + jar. In addition to registering a jar from a local system or from hdfs, you can now specify the coordinates of the + artifact and pig will download the artifact (and its dependencies if needed) from the configured repository. +

+ +
+ Parameters Supported in the Query String + +
+
• + Transitive +

+ Transitive helps specifying if you need the dependencies along with the registering jar. By setting transitive to + false in the querystring we can tell pig to register only the artifact without its dependencies. This will + download only the artifact specified and will not download the dependencies of the jar. The default value of + transitive is true. +

+ Syntax + + + + +  + REGISTER ivy://org:module:version?transitive=false +
+
• +
• + Exclude +

+ While registering an artifact if you wish to exclude some dependencies you can specify them using the exclude + key. Suppose you want to use a specific version of a dependent jar which doesn't match the version of the jar + when automatically fetched, then you could exclude such dependencies by specifying a comma separated list of + dependencies and register the dependent jar separately. +

+ Syntax + + + + +  + REGISTER ivy://org:module:version?exclude=org:mod,org:mod,... +
+
• +
• + Classifier +

+ Some maven dependencies need classifiers in order to be able to resolve. You can specify them using a classifier + key. +

+ Syntax + + + + +  + REGISTER ivy://org:module:version?classifier=value +
+
• +
+
+ +
+ Other properties + +
+
• +

+
• + +
• +

+ This command can be used or can replace the register jar command wherever used + including macros.

+

+
• + +
• +

+ Group/Organization and Version are optional fields. In such cases you can leave them blank.

+

+
• + +
• +

+ The repositories can be configured using an ivysettings file. Pig will search for an ivysettings.xml file + in the following locations in order. PIG_CONF_DIR > PIG_HOME > Classpath

+

+
• +
+
+
+ + +
+ Examples + +
+
• +

Registering an Artifact and all its dependencies.

+ + -- Both are the same

+ REGISTER ivy://org.apache.avro:avro:1.5.1

+ REGISTER ivy://org.apache.avro:avro:1.5.1?transitive=true +
• + +
• +

Registering an artifact without getting its dependencies.

+ + REGISTER ivy://org.apache.avro:avro:1.5.1?transitive=false +
• + +
• +

Registering the latest artifact.

+ + -- Both of the following syntaxes work.

+ REGISTER ivy://org.apache.avro:avro:+

+ REGISTER ivy://org.apache.avro:avro:* +
• + +
• +

Registering an artifact by excluding specific dependencies.

+ + REGISTER ivy://org.apache.pig:pig:0.10.0?exclude=commons-cli:commons-cli,commons-codec:commons-codec +
• + +
• +

Specifying a classifier

+ + REGISTER ivy://net.sf.json-lib:json-lib:2.4?classifier=jdk15 +
• + +
• +

Registering an artifact without a group or organization. Just skip them.

+ + REGISTER ivy://:module: +
• +
+
+
+ + + Modified: pig/branches/spark/src/docs/src/documentation/content/xdocs/func.xml URL: http://svn.apache.org/viewvc/pig/branches/spark/src/docs/src/documentation/content/xdocs/func.xml?rev=1733627&r1=1733626&r2=1733627&view=diff ============================================================================== --- pig/branches/spark/src/docs/src/documentation/content/xdocs/func.xml (original) +++ pig/branches/spark/src/docs/src/documentation/content/xdocs/func.xml Fri Mar 4 18:17:39 2016 @@ -167,10 +167,10 @@ DUMP C;

AVG

-

long

+

double

-

long

+

double

double

@@ -294,6 +294,87 @@ team_parkyearslist = FOREACH (GROUP team +
+ Bloom +

Bloom filters are a common way to select a limited set of records before + moving data for a join or other heavy weight operation.

+ +
+ Syntax + + + + + + + +
 + BuildBloom(String hashType, String mode, String vectorSize, String nbHash) + + Bloom(String filename) +
+ +
+ Terms + + + + + + + + + + + + + + + + + + + + + +
 hashtype The type of hash function to use. Valid values for the hash functions are 'jenkins' and 'murmur'. mode Will be ignored, though by convention it should be "fixed" or "fixedsize" vectorSize The number of bits in the bloom filter. nbHash The number of hash functions used in constructing the bloom filter. filename File containing the serialized Bloom filter.
+

See Bloom Filter for + a discussion of how to select the number of bits and the number of hash + functions. +

+
+ +
+ Usage +

Bloom filters are a common way to select a limited set of records before + moving data for a join or other heavy weight operation. For example, if + one wanted to join a very large data set L with a smaller set S, and it + was known that the number of keys in L that will match with S is small, + building a bloom filter on S and then applying it to L before the join + can greatly reduce the number of records from L that have to be moved + from the map to the reduce, thus speeding the join. +

+

+
+
+ Examples + + define bb BuildBloom('128', '3', 'jenkins'); + small = load 'S' as (x, y, z); + grpd = group small all; + fltrd = foreach grpd generate bb(small.x); + store fltrd in 'mybloom'; + exec; + define bloom Bloom('mybloom'); + large = load 'L' as (a, b, c); + flarge = filter large by bloom(L.a); + joined = join small by x, flarge by a; + store joined into 'results'; + +
+
@@ -634,8 +715,8 @@ SSN = load 'ssn.txt' using PigStorage() SSN_NAME = load 'students.txt' using PigStorage() as (ssn:long, name:chararray); -/* do a left outer join of SSN with SSN_Name */ -X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn; +/* do a cogroup of SSN with SSN_Name */ +X = COGROUP SSN by ssn, SSN_NAME by ssn; /* only keep those ssn's for which there is no name */ Y = filter X by IsEmpty(SSN_NAME); @@ -915,7 +996,8 @@ DUMP X;
PluckTuple -

Allows the user to specify a string prefix, and then filter for the columns in a relation that begin with that prefix.

+

Allows the user to specify a string prefix, and then filter for the columns in a relation that begin with that prefix or match that regex pattern. Optionally, include flag 'false' to filter + for columns that do not match that prefix or match that regex pattern

Syntax @@ -923,6 +1005,7 @@ DUMP X;

DEFINE pluck PluckTuple(expression1)

+

DEFINE pluck PluckTuple(expression1,expression3)

pluck(expression2)

@@ -937,7 +1020,7 @@ DUMP X;

expression1

-

A prefix to pluck by

+

A prefix to pluck by or an regex pattern to pluck by

@@ -948,6 +1031,14 @@ DUMP X;

The fields to apply the pluck to, usually '*'

+ + +

expression3

+ + +

A boolean flag to indicate whether to include or exclude matching columns

+ +
@@ -964,6 +1055,10 @@ describe c; c: {a::x: bytearray,a::y: bytearray,b::x: bytearray,b::y: bytearray} describe d; d: {plucked::a::x: bytearray,plucked::a::y: bytearray} +DEFINE pluckNegative PluckTuple('a::','false'); +d = foreach c generate FLATTEN(pluckNegative(*)); +describe d; +d: {plucked::b::x: bytearray,plucked::b::y: bytearray}
@@ -1369,23 +1464,20 @@ DUMP B;

To work with gzip compressed files, input/output files need to have a .gz extension. Gzipped files cannot be split across multiple maps; this means that the number of maps created is equal to the number of part files in the input location.

-A = load âmyinput.gzâ; -store A into âmyoutput.gzâ; +A = load 'myinput.gz'; +store A into 'myoutput.gz';

To work with bzip compressed files, the input/output files need to have a .bz or .bz2 extension. Because the compression is block-oriented, bzipped files can be split across multiple maps.

-A = load âmyinput.bzâ; -store A into âmyoutput.bzâ; +A = load 'myinput.bz'; +store A into 'myoutput.bz'; -

Note: PigStorage and TextLoader correctly read compressed files as long as they are NOT CONCATENATED FILES generated in this manner:

+

Note: PigStorage and TextLoader correctly read compressed files as long as they are NOT CONCATENATED bz/bz2 FILES generated in this manner:

• -

cat *.gz > text/concat.gz

-
• -
• cat *.bz > text/concat.bz

• @@ -1393,7 +1485,7 @@ store A into âmyoutput.bzâ;

-

If you use concatenated gzip or bzip files with your Pig jobs, you will NOT see a failure but the results will be INCORRECT.

+

If you use concatenated bzip files with your Pig jobs, you will NOT see a failure but the results will be INCORRECT.

@@ -1525,7 +1617,7 @@ dump X; @@ -1648,11 +1740,11 @@ STORE X INTO 'output' USING PigDump();

'options'

@@ -1928,7 +2022,7 @@ STORE A INTO 'hbase://users_table' USING
 - JsonLoader( [âschemaâ] ) + JsonLoader( ['schema'] ) - A string that contains space-separated options (âoptionA optionB optionCâ) + A string that contains space-separated options ('optionA optionB optionC') Currently supported options are: - (âschemaâ) - Stores the schema of the relation using a hidden JSON file. - (ânoschemaâ) - Ignores a stored schema during the load. + ('schema') - Stores the schema of the relation using a hidden JSON file. + ('noschema') - Ignores a stored schema during the load. ('tagsource') - (deprecated, Use tagPath instead) Add a first column indicates the input file of the record. ('tagPath') - Add a first column indicates the input path of the record. ('tagFile') - Add a first column indicates the input file name of the record. @@ -1863,6 +1955,8 @@ A = LOAD 'data' USING TextLoader(); less than this value -timestamp=timestamp Return cell values that have a creation timestamp equal to this value + -includeTimestamp=Record will include the timestamp after the rowkey on store (rowkey, timestamp, ...) + -includeTombstone=Record will include a tombstone marker on store after the rowKey and timestamp (if included) (rowkey, [timestamp,] tombstone, ...)
 - Avrostorage(['schema|record name'], ['options']) + AvroStorage(['schema|record name'], ['options'])
@@ -5934,7 +6028,7 @@ In this example, student names (type cha A = load 'students' as (name:chararray, age:int, gpa:float); B = foreach A generate TOMAP(name, gpa); -store B into âresultsâ; +store B into 'results'; Input (students) joe smith 20 3.5 @@ -6042,6 +6136,74 @@ bottomResults = FOREACH D { - + + + +
+Hive UDF +

Pig invokes all types of Hive UDF, including UDF, GenericUDF, UDAF, GenericUDAF and GenericUDTF. Depending on the Hive UDF you want to use, you need to declare it in Pig with HiveUDF(handles UDF and GenericUDF), HiveUDAF(handles UDAF and GenericUDAF), HiveUDTF(handles GenericUDTF).

+
+ Syntax +

HiveUDF, HiveUDAF, HiveUDTF share the same syntax.

+ + + + +
 + HiveUDF(name[, constant parameters]) +
+
+
+ Terms + + + + + + + + + +
 + name + + Hive UDF name. This can be a fully qualified class name of the Hive UDF/UDTF/UDAF class, or a registered short name in Hive FunctionRegistry (most Hive builtin UDF does that) + + constant parameters + + Optional tuple representing constant parameters of a Hive UDF/UDTF/UDAF. If Hive UDF requires a constant parameter, there is no other way Pig can pass that information to Hive, since Pig schema does not carry the information whether a parameter is constant or not. Null item in the tuple means this field is not a constant. Non-null item represents a constant field. Data type for the item is determined by Pig contant parser. +
+
+
+ Example +

HiveUDF

+ +define sin HiveUDF('sin'); +A = LOAD 'student' as (name:chararray, age:int, gpa:double); +B = foreach A generate sin(gpa); + +

HiveUDTF

+ +define explode HiveUDTF('explode'); +A = load 'mydata' as (a0:{(b0:chararray)}); +B = foreach A generate flatten(explode(a0)); + +

HiveUDAF

+ +define avg HiveUDAF('avg'); +A = LOAD 'student' as (name:chararray, age:int, gpa:double); +B = group A by name; +C = foreach B generate group, avg(A.age); + +
+

HiveUDAF with constant parameter

+ +define in_file HiveUDF('in_file', '(null, "names.txt")'); +A = load 'student' as (name:chararray, age:long, gpa:double); +B = foreach A generate in_file(name, 'names.txt'); + +

In this example, we pass (null, "names.txt") to the construct of UDF in_file, meaning the first parameter is regular, the second parameter is a constant. names.txt can be double quoted (unlike other Pig syntax), or quoted in \'. Note we need to pass 'names.txt' again in line 3. This looks stupid but we need to do this to fill the semantic gap between Pig and Hive. We need to pass the constant in the data pipeline in line 3, which is similar Pig UDF. Initialization code in Hive UDF takes ObjectInspector, which capture the data type and whether or not the parameter is a constant. However, initialization code in Pig takes schema, which only capture the former. We need to use additional mechanism (construct parameter) to convey the later.

+

Note: A few Hive 0.14 UDF contains bug which affects Pig and are fixed in Hive 1.0. Here is a list: compute_stats, context_ngrams, count, ewah_bitmap, histogram_numeric, collect_list, collect_set, ngrams, case, in, named_struct, stack, percentile_approx.

+
Modified: pig/branches/spark/src/docs/src/documentation/content/xdocs/perf.xml URL: http://svn.apache.org/viewvc/pig/branches/spark/src/docs/src/documentation/content/xdocs/perf.xml?rev=1733627&r1=1733626&r2=1733627&view=diff ============================================================================== --- pig/branches/spark/src/docs/src/documentation/content/xdocs/perf.xml (original) +++ pig/branches/spark/src/docs/src/documentation/content/xdocs/perf.xml Fri Mar 4 18:17:39 2016 @@ -47,10 +47,11 @@
Automatic parallelism

Just like MapReduce, if user specify "parallel" in their Pig statement, or user define default_parallel in Tez mode, Pig will honor it (the only exception is if user specify a parallel which is apparently too low, Pig will override it)

-

If user specify neither "parallel" or "default_parallel", Pig will use automatic parallelism. In MapReduce, Pig submit one MapReduce job a time and before submiting a job, Pig has chance to automatically set reduce parallelism based on the size of input file. On the contrary, Tez submit a DAG as a unit and automatic parallelism is managed in two parts

+

If user specify neither "parallel" or "default_parallel", Pig will use automatic parallelism. In MapReduce, Pig submit one MapReduce job a time and before submiting a job, Pig has chance to automatically set reduce parallelism based on the size of input file. On the contrary, Tez submit a DAG as a unit and automatic parallelism is managed in three parts

• Before submiting a DAG, Pig estimate parallelism of each vertex statically based on the input file size of the DAG and the complexity of the pipeline of each vertex
• -
• At runtime, Tez adjust vertex parallelism dynamically based on the input data volume of the vertex. Note currently Tez can only decrease the parallelism dynamically not increase. So in step 1, Pig overestimate the parallelism
• +
• When DAG progress, Pig adjust the parallelism of vertexes with the best knowledge available at that moment (Pig grace paralellism)
• +
• At runtime, Tez adjust vertex parallelism dynamically based on the input data volume of the vertex. Note currently Tez can only decrease the parallelism dynamically not increase. So in step 1 and 2, Pig overestimate the parallelism

The following parameter control the behavior of automatic parallelism in Tez (share with MapReduce):

@@ -492,7 +493,7 @@ Gtab = .... aggregation function STORE Gtab INTO '/user/vxj/finalresult2'; -

To make the script works, add the exec statement.

+

To make the script work, add the exec statement.

A = LOAD '/user/xxx/firstinput' USING PigStorage(); @@ -517,6 +518,11 @@ Ftab = group .... Gtab = .... aggregation function STORE Gtab INTO '/user/vxj/finalresult2'; + +

If the STORE and LOAD both had exact matching file paths, Pig will recognize the implicit dependency +and launch two different mapreduce jobs/Tez DAGs with the second job depending on the output of the first one. +exec is not required to be specified in that case.

+
@@ -978,11 +984,11 @@ B = GROUP A BY t PARALLEL 18;

In this example all the MapReduce jobs that get launched use 20 reducers.

SET default_parallel 20; -A = LOAD âmyfile.txtâ USING PigStorage() AS (t, u, v); +A = LOAD 'myfile.txt' USING PigStorage() AS (t, u, v); B = GROUP A BY t; C = FOREACH B GENERATE group, COUNT(A.t) as mycount; D = ORDER C BY mycount; -STORE D INTO âmysortedcountâ USING PigStorage(); +STORE D INTO 'mysortedcount' USING PigStorage(); @@ -1286,9 +1292,9 @@ C = JOIN A BY a1, B BY b1, C BY c1 USING
• Data must come directly from either a Load or an Order statement.
• There may be filter statements and foreach statements between the sorted data source and the join statement. The foreach statement should meet the following conditions:
-
• There should be no UDFs in the foreach statement.
• The foreach statement should not change the position of the join keys.
• -
• There should be no transformation on the join keys which will change the sort order.
• +
• There should be no transformation on the join keys which will change the sort order.
• +
• UDFs also have to adhere to the previous condition and should not transform the JOIN keys in a way that would change the sort order.
• Data must be sorted on join keys in ascending (ASC) order on both sides.
• Modified: pig/branches/spark/src/docs/src/documentation/content/xdocs/pig-index.xml URL: http://svn.apache.org/viewvc/pig/branches/spark/src/docs/src/documentation/content/xdocs/pig-index.xml?rev=1733627&r1=1733626&r2=1733627&view=diff ============================================================================== --- pig/branches/spark/src/docs/src/documentation/content/xdocs/pig-index.xml (original) +++ pig/branches/spark/src/docs/src/documentation/content/xdocs/pig-index.xml Fri Mar 4 18:17:39 2016 @@ -819,7 +819,7 @@

REGEX_EXTRACT_ALL function

-

REGISTER statement

+

REGISTER statement

regular expressions. See pattern matching

Modified: pig/branches/spark/src/docs/src/documentation/content/xdocs/start.xml URL: http://svn.apache.org/viewvc/pig/branches/spark/src/docs/src/documentation/content/xdocs/start.xml?rev=1733627&r1=1733626&r2=1733627&view=diff ============================================================================== --- pig/branches/spark/src/docs/src/documentation/content/xdocs/start.xml (original) +++ pig/branches/spark/src/docs/src/documentation/content/xdocs/start.xml Fri Mar 4 18:17:39 2016 @@ -249,7 +249,7 @@ grunt> A = load 'passwd' using PigStorage(':'); -- load the passwd file B = foreach A generate \$0 as id; -- extract the user IDs -store B into âid.outâ; -- write the results to a file name id.out +store B into 'id.out'; -- write the results to a file name id.out

Local Mode

Modified: pig/branches/spark/src/docs/src/documentation/content/xdocs/tabs.xml URL: http://svn.apache.org/viewvc/pig/branches/spark/src/docs/src/documentation/content/xdocs/tabs.xml?rev=1733627&r1=1733626&r2=1733627&view=diff ============================================================================== --- pig/branches/spark/src/docs/src/documentation/content/xdocs/tabs.xml (original) +++ pig/branches/spark/src/docs/src/documentation/content/xdocs/tabs.xml Fri Mar 4 18:17:39 2016 @@ -32,6 +32,6 @@ --> - + Modified: pig/branches/spark/src/docs/src/documentation/content/xdocs/udf.xml URL: http://svn.apache.org/viewvc/pig/branches/spark/src/docs/src/documentation/content/xdocs/udf.xml?rev=1733627&r1=1733626&r2=1733627&view=diff ============================================================================== --- pig/branches/spark/src/docs/src/documentation/content/xdocs/udf.xml (original) +++ pig/branches/spark/src/docs/src/documentation/content/xdocs/udf.xml Fri Mar 4 18:17:39 2016 @@ -1752,7 +1752,7 @@ function helloworld() { return 'Hello, World'; } -complex.outputSchema = "word:chararray,num:long"; +complex.outputSchema = "(word:chararray,num:long)"; function complex(word){ return {word:word, num:word.length}; } @@ -1760,8 +1760,8 @@ function complex(word){

This Pig script registers the JavaScript UDF (udf.js).