pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jie Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2661) Pig uses an extra job for loading data in Pigmix L9
Date Tue, 26 Jun 2012 21:14:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401687#comment-13401687
] 

Jie Li commented on PIG-2661:
-----------------------------

An interesting problem:

Previously for order-by, Pig will force any previous pipeline to finish and write to disk
first, and then sample the data and sort it, so the sampler will see the same data that will
be sorted. Now we want to merge the previous map-only pipeline into both the sampler and order-by.
The sampler will sample the data before that pipeline, and pass the sample results through
the pipeline to generate the partition file. See the query:

{code}
a = load 'data' as (x,y)
b = filter a by udf(x,y)
c = foreach b generate udf(x,y)
d = order c by x
{code}

Here a->b->c is the pipeline before order-by. Previously Pig will write c to the disk
first, and then the sampler will get samples from c; but now we want to avoid writing c to
the disk, so the sampler will load a to get samples and pass them through b and c to generate
the partition file. Here b and c can be projection, filter and any other non-blocking operators.

One concern is, would the new way of sampling still capture the distribution of the data to
be sorted? 

||What we want||What we have now||What we'll have||
|Distribution(a->b->c)|Distribution(Sample(a->b->c))|Distribution(Sample(a)->b->c)|

It's clear that Sample will keep the original distribution, so the three distributions in
the table would be equivalent. 

Another concern is the performance. With the patch, the sampler will do a full scan of the
table before the filter, which might be slower than before if the filter is very selective.
This might be acceptable considering that the sampler only parse a small percent of the data.
Will do some benchmark.

                
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
>                 Key: PIG-2661
>                 URL: https://issues.apache.org/jira/browse/PIG-2661
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Jie Li
>            Assignee: Jie Li
>         Attachments: PIG-2661.0.patch, PIG-2661.1.patch
>
>
> See https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message