pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth J (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2167) CUBE operation in Pig
Date Wed, 21 Mar 2012 23:46:22 GMT

    [ https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235214#comment-13235214
] 

Prasanth J commented on PIG-2167:
---------------------------------

Hello everyone

I am Prasanth Jayachandran, graduate student at The Ohio State University. I am working with
Prof. Arnab Nandi for providing CUBE operator support in Pig. I am submitting the initial
version of the CUBE operator implementation(naive version of cube materialization). As this
is my first patch submission I am really excited about it and am hoping to continue my contribution
for Apache Pig. Please review the attached patch and provide feedback from improvising it.

Following contents explains the design decision and some initial performance numbers (experiments
performed on single node pseudo-distributed hadoop setup). 
*Pig syntax for Cubing*
CUBE rel BY (a,b,c);

*SQL/Oracle syntax for Cubing*
GROUP BY CUBE (a,b,c);

*CUBE operator internals*
The CUBE operator injects the logical plan for following operators
x = FOREACH rel GENERATE FLATTEN(CubeDimensions(a,b,c));
y = GROUP x by (a,b,c);

*What is the output schema of CUBE operator?*
{group: tuple(a,b,c), cube: bag{(dimensions::a,dimensions::b,dimensions::c)}}

*Why syntactically different from SQL/Oracle?*
- Easier to implement as it does not modify or break the existing GROUP BY operator implementation
- CUBE operator might require separate flags for the following
	- Switching between BUC and STAR cubing (future optimization)
	- HAVING clause for monotonic operations
	- ROLLUP/DRILLDOWN operations 
	- Hint the location of partially computed CUBE
	- user specified inputs (example: algebraic attribute for converting the holistic measure
to partially algebraic measure can be specified by user)
- Some operations applicable in GROUP operator are not applicable for CUBE 
	- Constant expression evaluation 
	- Duplicate column projection
- Follows Pig language design principle of procedural simplicity

*Corner case handling*
Constant expressions can be provided in GROUP BY operator. Constant expressions support has
been removed from CUBE BY operator grammar. If constant expressions are used with CUBE BY,
FrontEndException is thrown.
Duplicate column projection is supported in GROUP BY. Duplicates columns will be eliminated
while generating logical plan. Current implementation ignores duplicates dimensions in CUBE
BY. This can also be modified to throw exception if user repeats a dimension more than once.

If cube dimensions are a subset of columns in input schema then the remaining columns in the
input schema will be pushed to the “cube” bag.
For example: 
inp = LOAD ‘/pig/data/input’ AS (a,b,c,d);
x = CUBE inp BY (a,b);
schema of x will be {group:tuple(a,b), cube:bag(dimensions::a,dimensions::b,c,d)}

*Performance*
*Apache Pig Test Environment*
OS: Ubuntu 11.04 running as guest OS in Virtual Box
CPU Cores: 2 (4 Threads)
Memory: 8GB
HDD: 100GB 
Mode: Single node pseudo-distributed mode setup running Hadoop-0.20.2.
Configuration: Default configurations of hadoop 
*SQL Server 2008 R2 Test Environment*
OS: Windows 7
CPU Cores: 4 (8 Threads)
Memory: 16GB
HDD: 500GB
!Pig-Cubing-Performance.png!

*Acknowledgements*
Professor Arnab Nandi, Department of Computer Science and Engineering, The Ohio State University,
Columbus for guidance and assistance throughout the course of this initial implementation.
Chaitanya Solarpurikar, Graduate Student, Department of Computer Science and Engineering,
The Ohio State University, Columbus for setting up SQL server test environment and running
performance comparison experiments.
                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>              Labels: gsoc2012
>         Attachments: Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation in data
warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- which in
addition to all dim1-2-3, produces aggregations for just dim1, just dim1 and dim2, etc. NULL
is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient cubing in Map-Reduce
here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, and work
up from there.
> This is a candidate project for Google summer of code 2012. More information about the
program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message