pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "PigLatinSchemas" by OlgaN
Date Wed, 07 Nov 2007 19:50:35 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:

New page:
=== Pig Latin Schemas ===

==== Defining a schema in a LOAD statement ====
The basic grammar for schema definition is taken from the JSON/Python  
tuple/list/map definition, and is as follows:
field1 = Atom alias
name : (f1, f2, ...) = Tuple alias and schema

So the schema:
(time, query : (display, normalized), results :  [url, title, summary])

would define a Tuple where the first field is an Atom called "time",  
the second field is a Tuple called "query" with the Atom fields  
"display" and "normalized", and the third field is a Bag called  
"results", which contains tuples that have three Atom fields  
"url", "title" and "summary".

The "AS" keyword on a LOAD statement allows you to define a schema for a particular alias.
For example,

A = load 'input1' as (tstamp, cookie, query);
B = load 'input2' as (query, url, rank);

associates schemas with A and B.

==== Schema Propagation ====

The system will do its best to infer the schema for a derived alias based on the schemas of
the input aliases. 

Continuing with our running example, suppose we have
C = cogroup A by query, B by query;
Then C will be assigned the schema `(group, A: [tstamp, cookie, query] , B: [query, url, rank])`

==== Referring to Nested Fields, i.e., Nested Projection ====
You can refer to fields up to 1 level below in the nesting. Thus, in the above example, you
can say,
foreach C generate group, A.cookie

==== Name Ambiguity Resolution ====
Sometimes, when using FLATTEN, there might be name ambiguities in schemas from two different
inputs. Thus, if in the above example, we write

D = foreach C generate flatten(A), flatten(B)

There will be a name ambiguity since both flatten(A) and flatten(B) have the field `query`.
To avoid ambiguity in such cases, fields can be referred to by `<outer-alias>::fieldName`.
Thus for C, we can refer to either `A::query` or `B::query` but not to `query`. 

However, the unambiguous fields can be accessed both by their names as well as by `<outer-alias>::fieldName`.
Thus for C, both `url` or `B::url` will access the same field.

==== Assigning Names to Individual Items in GENERATE ====
Just like in SQL where you can give names to individual items in the select list, we can name
individual items in the generate clause using AS. Thus, in our example,
E = foreach D generate (cookie eq 'null' ? 'null' : url ) as nullifiedUrl, rank as myRank;
This will assign a schema `(nullifiedUrl, myRank)` to E.

==== Schemas of Functions ====
Eval functions can specify their own output schema by overriding the outputSchema() method.
The builtin function SUM specifies that its output is called `sum`. Thus, 
F = foreach C generate group, SUM(tstamp);
F gets assigned the schema: `(group,sum)`. This can of course be overriden e.g., `generate
group, SUM(tstamp) as sumTstamp`.

==== Last Resort: Overriding system-inferred schemas ====
Sometimes the system cannot infer a schema (e.g., binconds, evalfunctions that dont specify
one). In these cases, and also in others when you want to override the system-inferred schema
you can override it using the AS clause. Thus, you could say:
C = (cogroup A by query, B by query) as (group, foo, bar);
and C would be assigned the schema `(foo,bar)`.

View raw message