pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2661) Pig uses an extra job for loading data in Pigmix L9
Date Mon, 03 Sep 2012 02:00:07 GMT

    [ https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447086#comment-13447086
] 

Dmitriy V. Ryaboy commented on PIG-2661:
----------------------------------------

Ok, for TestSkewedJoin, I think I know what's going on but not how to fix it.

Here's the explain plan for the sampler job after this patch:

{code}
MapReduce node scope-24
Map Plan
Local Rearrange[tuple]{tuple}(false) - scope-27
|   |
|   Constant(all) - scope-26
|
|---New For Each(true,true)[tuple] - scope-25
    |   |
    |   Project[bytearray][0] - scope-14
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-22
    |   |
    |   |---Project[tuple][*] - scope-21
    |
    |---A: New For Each(false,false,false)[bag] - scope-55
        |   |
        |   Project[bytearray][0] - scope-52
        |   |
        |   Project[bytearray][1] - scope-53
        |   |
        |   Project[bytearray][2] - scope-54
        |
        |---Load(hdfs://localhost:58995/user/dmitriy/SkewedJoinInput1.txt:org.ap
ache.pig.impl.builtin.PoissonSampleLoader('org.apache.pig.builtin.PigStorage','1
00')) - scope-23--------
{code}

Here are the corresponding bits prior to the patch:

{code}

MapReduce node scope-18
Map Plan
Store(hdfs://localhost:59383/tmp/temp220048876/tmp99560328:org.apache.pig.impl.i
o.InterStorage) - scope-20
|
|---A: New For Each(false,false,false)[bag] - scope-7
    |   |
    |   Project[bytearray][0] - scope-1
    |   |
    |   Project[bytearray][1] - scope-3
    |   |
    |   Project[bytearray][2] - scope-5
    |
    |---A: Load(hdfs://localhost:59383/user/dmitriy/SkewedJoinInput1.txt:org.apa
che.pig.builtin.PigStorage) - scope-0--------
Global sort: false
----------------

MapReduce node scope-24
Map Plan
Local Rearrange[tuple]{tuple}(false) - scope-27
|   |
|   Constant(all) - scope-26
|
|---New For Each(true,true)[tuple] - scope-25
    |   |
    |   Project[bytearray][0] - scope-14
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-22
    |   |
    |   |---Project[tuple][*] - scope-21
    |
    |---Load(hdfs://localhost:59383/tmp/temp220048876/tmp99560328:org.apache.pig
.impl.builtin.PoissonSampleLoader('org.apache.pig.impl.io.InterStorage','100')) 

{code}

What's happening is that the foreach to generate the first 3 columns, which Pig now adds to
ensure types, etc, work, is happening between the Sample Loader and the GetMemNumRows udf.
Sample Loader adds a couple of columns to the last tuple it outputs, with some stats about
the dataset it saw. When we put the projection between it and the GetMemNumRows, those extra
columns get dropped, and GetMemNumRows winds up completely breaking down, assuming that each
sample occurs 0 times, and the whole skewed join thing just turns into a regular join.  We
have to either get rid of the foreach, or add the columns PoissonSampleLoader adds, to the
foreach.
                
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
>                 Key: PIG-2661
>                 URL: https://issues.apache.org/jira/browse/PIG-2661
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Jie Li
>            Assignee: Jie Li
>         Attachments: PIG-2661.0.patch, PIG-2661.1.patch, PIG-2661.2.patch, PIG-2661.3.patch,
PIG-2661.4.patch, PIG-2661.5.patch, PIG-2661.6.patch, PIG-2661.7.patch, PIG-2661.8.patch,
PIG-2661.plan.txt
>
>
> See https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message