hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1264) Skewed join sampler misses out the key with the highest frequency
Date Fri, 05 Mar 2010 19:38:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841984#action_12841984

Richard Ding commented on PIG-1264:

Somehow during the many merges between the trunk and LSR branch, the random sampler is used
in place of Poisson sampler for skewed join. This was corrected with PIG-1273.   

> Skewed join sampler misses out the key with the highest frequency
> -----------------------------------------------------------------
>                 Key: PIG-1264
>                 URL: https://issues.apache.org/jira/browse/PIG-1264
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Sriranjan Manjunath
>            Assignee: Richard Ding
>             Fix For: 0.7.0
> I am noticing two issues with the sampler used in skewed join:
> 1. It does not allocate multiple reducers to the key with the highest frequency.
> 2. It seems to be allocating the same number of reducers to every key (8 in this case).
> Query:
> a = load 'studenttab10k' using PigStorage() as (name, age, gpa);
> b = load 'votertab10k' as (name, age, registration, contributions);
> e = join a by name right, b by name using "skewed" parallel 8;
> store e into 'SkewedJoin_9.out';

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message