pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohini Palaniswamy (JIRA)" <j...@apache.org>
Subject [jira] [Created] (PIG-4344) Add a testcase with CustomPartitioner that tests ordering within a reducer
Date Wed, 26 Nov 2014 17:26:12 GMT
Rohini Palaniswamy created PIG-4344:
---------------------------------------

             Summary: Add a testcase with CustomPartitioner that tests ordering within a reducer
                 Key: PIG-4344
                 URL: https://issues.apache.org/jira/browse/PIG-4344
             Project: Pig
          Issue Type: Bug
            Reporter: Rohini Palaniswamy


 Some of our users have a CustomPartitioner with join or group by as they know their data
and know the keys to partition on. Since mapreduce provides data sorted within a reducer,
they rely on that to have the data ordered as well. 

For eg:
partition = group  mydata by (hour, sortkey1, sortkey2, sortkey3) using MyCustomPartitioner
PARALLEL 24;

The custom partitioner sends hours 0-23 to partitions 0-23, which ensures that the data is
also sorted without having to do a group by.  

With HCatStorer, this pattern will be used more. i.e, 
partition = group  mydata by (hour) using MyCustomPartitioner PARALLEL 24;
store partition into 'mydb.mytable' using HCatStorer();
    instead of
store mydata into 'mydb.mytable' using HCatStorer();

where hour is the partition. The extra groupby above is to avoid having 1 file created per
partition instead of 24 files per partition and concatenating them later to save namespace.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message