pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2831) MR-Cube implementation (Distributed cubing for holistic measures)
Date Tue, 04 Sep 2012 03:57:07 GMT

    [ https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447471#comment-13447471

Prasanth J commented on PIG-2831:

One more thing I forgot to mention. Since we have implemented statistics gathering which is
a runtime operation, we cannot fallback to naive cubing if we detect a small dataset. Earlier
we estimated the total number of rows during the compile time and based on our estimation
we chose whether to use mrcube approach or naive approach. We need to provide a way for the
user to disable mrcube approach for smaller dataset as naive cubing on small dataset is much
faster than mrcube. mrcube takes 4 MRJobs whereas naive cubing can be done in a single job.
Should we provide pig property for enabling/disabling mrcube approach?
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>                 Key: PIG-2831
>                 URL: https://issues.apache.org/jira/browse/PIG-2831
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>         Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch, PIG-2831.3.git.patch,
PIG-2831.4.git.patch, PIG-2831.5.git.patch
> Implementing distributed cube materialization on holistic measure based on MR-Cube approach
as described in http://arnab.org/files/mrcube.pdf. 
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few cases, if automatic
detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm and generates
annotated cube lattice (contains large group partitioning information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using distributed
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of actual cube
materialization job
> 7) OOM exception handling

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message