Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by PiSong:
http://wiki.apache.org/pig/PigOLAPSketch
New page:
= Pig OLAP Support =
== Background ==
Online analytical data processing is a powerful method for business data analysis. However,
the preprocessing of full OLAP cubes involves multiplications and aggregations of large volume
of data, thus being computationally expensive in nature. By allowing the process to be executed
on grid data processing framework such as MapReduce, this data analysis technique can be more
accessible for most business users.
== Goals ==
* We will allow population of full OLAP cubes
* We will allow tabular projection operations (1D  as Bag, 2D  as crosstab Bag)
The projection operator will be provided just for experimenting with data at this stage. Pig
currently doesn't have indexing infrastructure support so it can be too slow in practice.
My primary aim is to use Pig for cube processing then export to external system/frontend for
browsing.
== Implementation ==
Cube generation algorithm will be based on paper from Microsoft Research http://paul.rutgers.edu/~aminabdu/cs541/cube_op.pdf
but converted to MapReduce version.
We only work on flattened star, snowflake schema as generally do.
==== Map Phase ====
Map:( k1 , v1 ) > list( k2 , v2 )
{{{
Map Input: ((x1,x2,x3, ..., xn), (measure1, measure2))
Map Output:
Cn,1
((MAGICKEY,x2,x3, ..., xn), (measure1, measure2)1)
((x1,MAGICKEY,x3, ..., xn), (measure1, measure2)1)
...
((x1, x2, x3, ..., MAGICKEY), (measure1, measure2)1)
Cn,2
((MAGICKEY,MAGICKEY,x3, ..., xn1, xn), (measure1, measure2)1)
((MAGICKEY, x2, MAGICKEY, ..., xn1, xn), (measure1, measure2)1)
...
((x1, x2, x3, ...,MAGICKEY, MAGICKEY), (measure1, measure2)1)
...
Cn,n1
}}}
There will be 2^n1 output records from an input
where n = number of dimensions
==== Reduce Phase ====
Reduce: ( k2 , list ( v2 )) > list( v2 )
Basically just apply aggregate function(s) on all the measures. The logical view of output
(and obviously also a point in hypercube) will look like:
 aggregate1aggregate2aggregate3
measure1output1output2output3
measure2output4output5output6
measure3output7output8output9
== Language Design ==
We will allow the preprocessing of full OLAP cube. This will result in a new type called "Cube"
Analytical cube creation syntax
{{{
Name = CUBE Name
Dimension (ColumnName (,ColumnName)* )
MEASURE (ColumnName (,ColumnName)* )
AGGREGATE (FuncName (,FuncName)* )
[ MAGICKEY (Number) ]?
}}}
Neglecting "MAGICKEY" clause will result in automatically picked MAGICKEY (will be hardcoded)
Sample:
{{{
Y = CUBE X
DIMENSION (D1, D2, D3)
MEASURE (M1,M2)
AGGREGATE (SUM, AVG, UDF1)
MAGICKEY (2.718281828459045) ;
}}}
=== Cube browsing syntax ===
Only 1D and 2D can be displayed and understood easily by normal users so we are only interested
in 1D projection, and 2D pivoting of hypercubes.
==== Normal Projection ====
We model 1D cube browsing as a process of traversing a dimension/dimensionmember tree
For example,
D1 = {Sydney, Melbourne, Brisbane}
D2 = { Pi, Jon, Lap }
D3 = { Jan, Feb }
{{{
Root
/  \
D1 D2 D3 *
/  \
Syd Mel Bris
/ \
D2 D3 *
/  \
Pi Jon Lap

D1 *
/ \
Jan Feb
}}}
''Note'': Only levels with "*" can be viewed
/D1 would give:
Syd 1000
Mel 1200
Bris 1500
/D1/Syd/D2 would give:
(Under D1=Syd)
Pi  100
Jon  200
Lap  300
So, the syntax might look like:
{{{
Z = CUBEVIEW X AT /D1/Syd/D2 ;
}}}
where X is a cube
This command will result in a bag containing the aggregated data
==== Pivoting ====
Basically we just want to be able to define two dimensions for pivoting. Dimensions appeared
up in the dimension tree are not available in the lower level.
{{{
Z = CUBEVIEW X AT /D1/Syd/(D2,D3) ;
}}}
Output will look like this (in Bag):
(Under D1=Syd)
 Jan Feb
Pi 50 50
Jon 150 50
Lap 100 200
== Limitations ==
Initially, there will be no support for:
* Nondiscrete dimension support. This will require histogram generation support in Pig.
* Hierarchical dimension. Hierarchical dimension that forms a lattice.
* No indexing support on cube browsing
