hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2735) Setting default tmp directory for java createTempFile (java.io.tmpdir)
Date Wed, 06 Feb 2008 18:39:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566261#action_12566261
] 

Allen Wittenauer commented on HADOOP-2735:
------------------------------------------

For the most part, I agree with Pi's comments.  

Koji and I just had a quick discussion about this and I think we've come up with a good idea.
  Now we want to toss it to the wolves. :)

Quick summary of the issue as I understand it:

1) We have applications that depend upon java.io.tmp properties to be set.

2) These applications may be independently/inadvertently writing data to the same place. 
If this data is large, there may be a disk overflow issue.  On UNIX, this may have dire consequences
(/tmp being either on / or be in swap)

3) Hard coding is generally bad, as it makes assumptions about task behavior and file system
layout.  In particular, ./tmp is bad because, it makes the assumption that the task hasn't
changed cwd itself.  

So this is what we propose:

We create a new Hadoop property called mapred.child.tmp.  This property takes three values:

default == we leave java.io.tmp alone

dynamic == we dynamically calculate the full path of our mapred task directories tmp dir 
(the end result would be the equivalent of ./tmp, except that instead of depending upon '.',
it would be the actual path to where mapred normally cwd's to.. mapred.local.dir/blah/blah/blah/.../tmp
.)

anything else == a path provided by the user

With this type of change, we can cover a wide variety of cases, such as applications that
assume that io.tmp is the same across all tasks, applications that require separate io.tmp's
across all tasks, gives ops the benefit of being able to 'spread the load' across multiple
drives, etc.

Thoughts?

> Setting default tmp directory for java createTempFile (java.io.tmpdir)
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-2735
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2735
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Koji Noguchi
>            Assignee: Amareshwari Sri Ramadasu
>            Priority: Critical
>             Fix For: 0.16.1
>
>         Attachments: patch-2735.txt
>
>
> On our cluster, we've seen Pig(http://incubator.apache.org/pig/) filling up the /tmp
and failing. 
> (also inefficient since all the local tasks were spilling to the  same disk)
> Pig is simply using java api createTempFile, 
> http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String,%20java.io.File
> Can we add -Djava.io.tmpdir="./tmp" somewhere ?
> so that, 
> 1) Tasks can utilize all disks when using tmp
> 2) Any undeleted tmp files will be deleted by the tasktracker when task(job?) is done.
> The easiest way is to set it inside mapred.child.java.opts in the config, but this can
be overwritten if the users set their own task heapsize.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message