hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1218) Use distributed cache to store samples
Date Wed, 10 Feb 2010 23:20:31 GMT

     [ https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Richard Ding updated PIG-1218:

    Attachment: PIG-1218.patch

This patch uses Hadoop DistributedCache to cache the sample files used by order by and skewed
join, as well as the side files used in FR join.

When a HDFS file is added to the DistributedCache,  Pig generates a symlink to the file and,
at runtime, this symlink is used to open the file  from the local working directory of the
task. To avoid symlink colision, instead of using file name, a symlink name is generated by
using a combination of the hashcode of the file path and the current timestamp. 

The replication factor for the sample file in HDFS is not changed with this patch. The reasons
are that we're not clear what's the right factor to increase, and the work to implement the
change in Pig is not trivail. 

> Use distributed cache to store samples
> --------------------------------------
>                 Key: PIG-1218
>                 URL: https://issues.apache.org/jira/browse/PIG-1218
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>         Attachments: PIG-1218.patch
> Currently, in the case of skew join and order by we use sample that is just written to
the dfs (not distributed cache) and, as the result, get opened and copied around more than
necessary. This impacts query performance and also places unnecesary load on the name node

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message