pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park (JIRA)" <j...@apache.org>
Subject [jira] [Created] (PIG-4148) Tez order-by is often skewed because FindQuantiles UDF is called with small number
Date Sun, 31 Aug 2014 22:11:20 GMT
Cheolsoo Park created PIG-4148:
----------------------------------

             Summary: Tez order-by is often skewed because FindQuantiles UDF is called with
small number
                 Key: PIG-4148
                 URL: https://issues.apache.org/jira/browse/PIG-4148
             Project: Pig
          Issue Type: Sub-task
          Components: tez
            Reporter: Cheolsoo Park
            Assignee: Cheolsoo Park
             Fix For: 0.14.0


In Tez, FindQuantiles UDF is called with a smaller number of samples than MR resulting in
skew in range partitions.

For example, I have a job that runs sampling with a parallelism of 300. Since each task samples
100 records, the total sample should be 30K. But FindQuantiles UDF is called with only 300
records-
{code}
# Plan on vertex
POValueOutputTez - scope-282    ->   [scope-283]
|
|---New For Each(false)[tuple] - scope-281
    |   |
    |   POUserFunc(org.apache.pig.backend.hadoop.executionengine.tez.FindQuantilesTez)[tuple]
- scope-280
    |   |
    |   |---Project[tuple][*] - scope-279
    |
    |---New For Each(false,false)[tuple] - scope-278
        |   |
        |   Constant(300) - scope-277 <--- 300 should be 30K!
        |   |
        |   Project[bag][1] - scope-275
        |
        |---Package(Packager)[tuple]{bytearray} - scope-274
{code}
This is because we set the number of samples to the parallelism of the sampling vertex.
{code}
// We temporarily set it to rp and will adjust it at runtime, because the final degree of
parallelism
// is unknown until we are ready to submit it. See PIG-2779.
rpce.setValue(rp);
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message