hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul Muralidharan <mrid...@yahoo-inc.com>
Subject Re: [jira] Updated: (PIG-1309) Map-side Cogroup
Date Fri, 03 Sep 2010 11:28:59 GMT

Condition (1) refers to only explicit (user specified) statements right ?
Not implicit project introduced by pig to conform to schema ?


Regards,
Mridul


On Saturday 21 August 2010 12:59 AM, Ashutosh Chauhan (JIRA) wrote:
>
>       [ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
>
> Ashutosh Chauhan updated PIG-1309:
> ----------------------------------
>
>      Release Note:
> With this patch, it is now possible to perform map-side cogroup if data is sorted and
loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional
restrictions.
>
> Following preconditions must be met to use this feature:
> 1) No other operations can be done between load and cogroup statements.
> 2) Data must be sorted on join keys for all tables in ASC order.
> 3) Nulls are considered smaller then everything. So, if data contains null keys, they
should occur before anything else.
> 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}.
> 5) All other loaders must implement IndexableLoadFunc.
> 6) Type information must be provided in schema for all the loaders.
>
> Note that Zebra loader satisfies all of these conditions, so can be used out of box.
>
> Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as well.
>
> Example:
> A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted');
> B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted');
> C = COGROUP A by id, B by id using 'merge';
>
>
>    was:
> With this patch, it is now possible to perform map-side cogroup if data is sorted and
loader implements certain interfaces. Primary algorithm is based on sort-merge join with additional
restrictions.
>
> Following preconditions must be met to use this feature:
> 1) No other operations can be done between load and join statements.
> 2) Data must be sorted on join keys for all tables in ASC order.
> 3) Nulls are considered smaller then everything. So, if data contains null keys, they
should occur before anything else.
> 4) Left-most loader must implement {CollectableLoader} interface as well as {OrderedLoadFunc}.
> 5) All other loaders must implement IndexableLoadFunc.
> 6) Type information must be provided in schema for all the loaders.
>
> Note that Zebra loader satisfies all of these conditions, so can be used out of box.
>
> Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as well.
>
> Example:
> A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted');
> B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted');
> C = COGROUP A by id, B by id using 'merge';
>
>
>
>> Map-side Cogroup
>> ----------------
>>
>>                  Key: PIG-1309
>>                  URL: https://issues.apache.org/jira/browse/PIG-1309
>>              Project: Pig
>>           Issue Type: Bug
>>           Components: impl
>>             Reporter: Ashutosh Chauhan
>>             Assignee: Ashutosh Chauhan
>>              Fix For: 0.7.0, 0.8.0
>>
>>          Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch
>>
>>
>> In never ending quest to make Pig go faster, we want to parallelize as many relational
operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845
, PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup
in Pig. Details to follow.
>


Mime
View raw message