pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2661) Pig uses an extra job for loading data in Pigmix L9
Date Mon, 03 Sep 2012 17:11:08 GMT

    [ https://issues.apache.org/jira/browse/PIG-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447354#comment-13447354
] 

Dmitriy V. Ryaboy commented on PIG-2661:
----------------------------------------

Ok, some fresh thoughts rolling in after sleeping on this.

Why do we have this foreach in the first place? It's inserted to achieve the following goals:
* pad nulls (in PIG-2824, Jie saw perf problems from that, and I suggested we get rid of the
foreach altogether, getting POLoad to do the null padding instead).
* coerce tuples generated by the loader into schemas specified in the "load as.." statement
* drop unneeded columns

(please let me know if this list is incomplete)

For padding nulls, I believe we can achieve the same effect much more cheaply, and without
the side effect that's biting us here, by making basic modifications to POLoad.

For coercing into schemas, we can do the same thing -- copy all the fields from the incoming
tuple (including excess ones), and only convert the ones we know something about. This can
also be done directly in POLoad, and only be triggered if the loader doesn't already tell
us what the schema is it's returning, or the schemas don't match type-wise.

This leaves dropping columns. Since in that case the whole point is to not carry along unwanted
columns, this use case is clearly in conflict with the way the PoissonSampleLoader wants to
work, by inserting extra columns and sneaking them through to the UDF linked to it. Moreover,
if we go the route of putting the plan between load and skewed join between the sample loader
and the GetMemNumRows UDF, other things may also break the sampling -- for example, filters
that happen to filter out the specially marked tuples, by accident. This is telling us that
messing with the tuples PSL returns is problematic. What if instead we created a UDF that
was fed all the tuples from a regular loader, with the rest of the pipeline that gets inserted,
but was able to signal to its consumers when it's done -- thus effectively recreating PoissonSampleLoader's
functionality in addition to GetMemNumRows ? It would output sample tuples or nulls, and we
can add a null filter right above it. I believe that gives us everything we are looking for
and simplifies the pipeline a fair bit.  We'd have to add capability for UDFs to early-terminate,
of course. That's already been done for Accumulative UDFs in PIG-2066 and I think should be
straightforward to do for regular UDFs.

Thoughts?
                
> Pig uses an extra job for loading data in Pigmix L9
> ---------------------------------------------------
>
>                 Key: PIG-2661
>                 URL: https://issues.apache.org/jira/browse/PIG-2661
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Jie Li
>            Assignee: Jie Li
>         Attachments: PIG-2661.0.patch, PIG-2661.1.patch, PIG-2661.2.patch, PIG-2661.3.patch,
PIG-2661.4.patch, PIG-2661.5.patch, PIG-2661.6.patch, PIG-2661.7.patch, PIG-2661.8.patch,
PIG-2661.plan.txt
>
>
> See https://issues.apache.org/jira/browse/PIG-200?focusedCommentId=13260155&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13260155

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message