hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-241) Sharding and joins
Date Fri, 30 May 2008 12:45:45 GMT

    [ https://issues.apache.org/jira/browse/PIG-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601122#action_12601122
] 

Pi Song commented on PIG-241:
-----------------------------

I think we would have to work below the abstraction provided by Hadoop in order to achieve
such optimization.  This would mean Hadoop has to support direct control over physical file
placement in its APIs.

My suggestion:-

One possible optimization from distributed database textbooks is fragment-aware relational
algebra. Files on HDFS are small chunks which are already natural fragments. If we could :-
 - Cluster same or close keys in the same set of chunks.
 - Map sets of chunks to sets of key ranges using Metadata.
Then we should be able to save a fair amount of unnecessary processing.

> Sharding and joins
> ------------------
>
>                 Key: PIG-241
>                 URL: https://issues.apache.org/jira/browse/PIG-241
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>            Reporter: John DeTreville
>
> Many large distributed systems for storage and computing over tables divide these tables
into smaller _shards,_ such that all rows with the same (primary) key will appear in the same
shard. If two tables are consistently sharded, then they can be joined shard-by-shard. If
corresponding shards are stored on the same hosts (or racks), then joins can be performed
locally on those hosts without copying the rows of the tables over the network; this can produce
significant speedups.
> Pig does not currently provide application-controlled sharding and the associated shard
placement and computation placement. The performance of joins therefore suffers in many scenarios;
rows are passed over the network multiple times when performing a join. If Pig (and Hadoop)
could provide the ability for the application to shard tables consistently, according to an
application-controlled policy, joins could be completely local operations and could in many
cases perform much better.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message