pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-4504) Enable Secondary key sort feature in spark mode
Date Fri, 15 May 2015 13:01:00 GMT

     [ https://issues.apache.org/jira/browse/PIG-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xuefu Zhang updated PIG-4504:
-----------------------------
       Resolution: Fixed
    Fix Version/s: spark-branch
           Status: Resolved  (was: Patch Available)

Committed to Spark branch. Thanks, Liyun!

> Enable Secondary key sort feature in spark mode
> -----------------------------------------------
>
>                 Key: PIG-4504
>                 URL: https://issues.apache.org/jira/browse/PIG-4504
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4504.patch, PIG-4504_2.patch, PIG-4504_3.patch, PIG-4504_4.patch,
PIG-4504_5.patch, PIG-4504_6.patch, PIG-4504_7.patch, SecondaryKeySort_design_doc (1).docx,
Why_need_split_PoLocalRearrange_POGlobalRearrange_POPackage_into_two_SparkNodes_in_sparkPlan.docx
>
>
> *Some knowledge about secondary key sort:*
> MapReduce framework automatically sorts the keys generated by mappers. This means that,
before starting reducers all intermediate (key, value) pairs generated by mappers must be
sorted by key (and not by value). Values passed to each reducer are not sorted at all and
they can be in any order. But if we make (key,value) as a compound key, let (key, value) pairs
changes to ((key,value), null) pairs. Here we call (key,value) as compound key, key is the
first key, value is the secondary key. In the shuffle process, pairs with the same first key
will be grouped into the same partition by setting PartitionerClass in the JobConf . Pairs
with the same first key but different secondary key will be sorted in the process of shuffle
by setting SortComparatorClass in the JobConf. Pairs with the same first key but different
secondary key will be transferred to the same reduce function by setting GroupingComparatorClass
in the JobConf. 
> *How pig implements secondary key sort in mapreduce mode?*
> In MR:  it implements secondary key sort by setting GroupingComparatorClass, PartitionerClass,
SortComparatorClass in [JobControlCompiler#getJob|https://github.com/kellyzly/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java#L915]
> *An example use secondary key sort:*
> TestAccumulator#testAccumWithSort
> Currently, secondary key sort feature is not implement in spark mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message