pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2831) MR-Cube implementation (Distributed cubing for holistic measures)
Date Tue, 04 Sep 2012 02:38:07 GMT

    [ https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447460#comment-13447460
] 

Prasanth J commented on PIG-2831:
---------------------------------

Hi Dmitriy,

I have implemented the new inter storage with statistics gathering and new sample loader as
per your idea on RB. Attached is the new patch containing the following changes
1) Added new RichInterStorage which implements StoreMetadata and LoadMetadata interfaces for
storing and loading the statistics of intermediate data. RichInterStorage uses RichRecordReader,
RichInputFormat for reading intermediate data and RichRecordWriter, RichOutputFormat for storing
intermediate data. RichRecordWriter and RichOutputFormat are the same as InterRecordWriter
and InterOutputFormat. The main difference is with the RichRecordReader and RichInputFormat.
The RichInputFormat wraps all the splits to one logical split so that only one mapper is used
for loading sample dataset. 
2) CubeSampleLoader uses underlying RichRecordReader for getting random samples of data. RichRecordReader
opens utmost 100 inner splits and chooses a random split while reading the tuple. 
3) Changes to PigOutputCommitter for storing statistics. Statistics are stored at the end
of every commitTask(). Statistics are stored for each output partition. RichInterStorage takes
care of loading all the statistics corresponding to different partitions and aggregating them
together. Statistics stores the numberOfRows and avgInMemTupleSize for each partitions (only
these two values are required for holistic cubing).

This patch is quite bigger mainly because most of the changes (at the logical layer) are due
to an old formatting issue which I fixed in this patch. Sorry about that. 

I have also updated the patch in RB. Please review it and let me know your feedback. Also
I have kept some of the issues opened in your earlier review comments which require some of
your thoughts. 

                
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
>                 Key: PIG-2831
>                 URL: https://issues.apache.org/jira/browse/PIG-2831
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>         Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch, PIG-2831.3.git.patch,
PIG-2831.4.git.patch, PIG-2831.5.git.patch
>
>
> Implementing distributed cube materialization on holistic measure based on MR-Cube approach
as described in http://arnab.org/files/mrcube.pdf. 
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few cases, if automatic
detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm and generates
annotated cube lattice (contains large group partitioning information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using distributed
cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of actual cube
materialization job
> 7) OOM exception handling

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message