hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-450) PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.
Date Tue, 23 Sep 2008 21:19:44 GMT

     [ https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alan Gates updated PIG-450:
---------------------------

    Attachment: PIG-450.patch

This patch adds a combiner step to distincts that just removes the duplicate values so that
less data is carried across from map to reduce.  Here are the resulting time differences (all
times in seconds):

||Num records||Num keys||Num reducers||1.4 || 2.0 || 2.0 with this patch ||
| 200M | 60 | 1 | 2547 | 1388 | 142 |
| 200M | 16M | 50 | 384 | 227 | 231 |

The main benefit is with a small number of keys, but there does not appear to be a penalty
with a larger number of keys.



> PERFORMANCE:  Distinct should make use of combiner to remove duplicate values from keys.
> ----------------------------------------------------------------------------------------
>
>                 Key: PIG-450
>                 URL: https://issues.apache.org/jira/browse/PIG-450
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: PIG-450.patch
>
>
> In 2.0 distinct was improved by removing values in the map and just passing an empty
tuple along with the key.  This can be further improved by adding a combiner step that passes
along only the first empty tuple instead of all of them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message