hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-116) pig leaves temp files behind
Date Mon, 25 Feb 2008 22:37:51 GMT

    [ https://issues.apache.org/jira/browse/PIG-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572283#action_12572283
] 

Olga Natkovich commented on PIG-116:
------------------------------------

Pi, I agree that temp space solution is the right way to go. In fact, I already discussed
this with hadoop team and they agreed in principle that this is the right thing to do though
it is not clear when that might happen.

Short term, we agreed on a better solution as well. They would provide the application a way
to register cleanup functions that are guaranteed to run before DFS shuts down.

> pig leaves temp files behind
> ----------------------------
>
>                 Key: PIG-116
>                 URL: https://issues.apache.org/jira/browse/PIG-116
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Olga Natkovich
>
> Currently, pig creates temp dirs via call to FileLocalizer.getTemporaryPath. They are
created on the client and are mainly used to store data between 2 M-R jobs. Pig then attempts
to clean them up in the client's shutdown hook. 
> The problem with this approach is that, because there is now way to order the shutdown
hooks, in some cases, the DFS is already closed when we try to delete the files in which case
a substention amount of data can be left in DFS. I see this issue more frequently with hadoop
0.16 perhaps because I had to add an extra shutdown hook to handle hod disconnects.
> The short term, I would like to propose the approach below:
> (1) If trash is configured on the cluster, use trash location to create temp directory
that will expire in 7 days. The hope is that most jobs don't run longer that 7 days. The user
can specify a longer interval via a command line switch
> (2) If trash is not enabled on the cluster, the location that we use now will be used
> (3) In the shutdown hook, we will attempt to cleanup. If the attempt fails and trash
is enabled, we let trash handle it; otherwise we provide the list of locations to the user
to clean. (I realize that this is not ideal but could not figure out a better way.)
> Longer term, I am talking with hadoop team to have better temp file support: https://issues.apache.org/jira/browse/HADOOP-2815
> Comments? Suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message