hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-211) Replicating small tables for joins
Date Sat, 19 Apr 2008 15:18:22 GMT

    [ https://issues.apache.org/jira/browse/PIG-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590708#action_12590708

Pi Song commented on PIG-211:

These might be useful for you:-

1) What really happens in our Pig MapReduce execution engine is that all the records on both
sides are separated into a number of buckets based on sort key. Then a local sort is used
anyway as a part of Reduce (We can do this way because at the moment we only support equal
join). Here the size of data in each bucket statistically will not be too big. Though, there
could be some kinds of data skews. Possibly one way to help if some buckets are still too
big is to use a second bucketing function to further slice into smaller buckets. A parameterized
partitioner could be used as well but I don't think Hadoop currently supports it :(

2) One way we could do what you've suggested easily is to use a UDF that reads from the small
table file. The small table file can be shipped to all the processing nodes using the mechanism
similar to what we've got in Pig Streaming(See Pig Streaming SHIP in Pig Wiki). I really start
to think that the SHIP construct should not be limited to Streaming.

This is a part of optimization work that hasn't started yet, though it's good that we've started
a discussion. What about your opinion? Please keep giving us your ideas!!

> Replicating small tables for joins
> ----------------------------------
>                 Key: PIG-211
>                 URL: https://issues.apache.org/jira/browse/PIG-211
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>            Reporter: John DeTreville
>            Priority: Minor
> Joining a table A with a small table B can be disproportionately expensive if A must
be sorted before the join, and the result must be sorted again. This effort can often be reduced
or eliminated if table B is replicated in whole to all nodes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message