hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pi Song (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-129) need to create temp files in the task's working directory
Date Sun, 02 Mar 2008 13:00:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574215#action_12574215

Pi Song commented on PIG-129:

I want to clarify a bit more about what I think and *I really need you opinion* on this bit.
Regarding temp file creation due to DataBag spill,  this can happen in 2 places:-
- In Hadoop Map Reduce execution engine
- In Local execution engine

I agree with you that the working dir mechanism in hadoop is already good and you're trying
to adopt it *BUT* what about local execution engine? 

I think even most people pay more attention on Hadoop backend and that's where Pig started,
but the local engine still has its use.

A sample use case would be if I have a big data file on my harddisk(thus cannot be too big)
and what I do is I just download Pig and then quickly write a pig script to perform processing
in my local machine using local execution engine (without running Hadoop)

A good local engine implementation will help improve usability of Pig!!!

Can we handle this issue in 2 different ways? One for hadoop backend, one for local engine.
I'm willing to implement what I've proposed in the last comment for the local engine.

> need to create temp files in the task's working directory
> ---------------------------------------------------------
>                 Key: PIG-129
>                 URL: https://issues.apache.org/jira/browse/PIG-129
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Amir Youssefi
> Currently, pig creates temp data such is spilled bags in the directory specified by java.io.tmpdir.
The problem is that this directory is usually shared by all tasks and can easily run out of
> A better approach would be to create this files in the temp dir inside of the taks working
directory as these locations usually have much mor space and also they can be hosted on different
disks so the performance could be better.
> There are 2 parts to this fix:
> (1) in org.apache.pig.data.DataBag to check if the temp directory exists and create it
if not before trying to create the temp file. This is somewhere around line 390 in the code.
> (2) Change the mapred.child.java.opts in hadoop-site.xml to include new value for tmpdir
property to point to ./tmp. For instance: 
> <property>
>         <name>mapred.child.java.opts</name>
>         <value>-Xmx1024M -Djava.io.tmpdir="./tmp"</value>
>         <description>arguments passed to child jvms</description>
> </property>

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message