hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed
Date Mon, 02 Nov 2009 18:25:02 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772585#action_12772585

Doug Cutting commented on MAPREDUCE-323:

> Per-hour directories look like over-kill. On the average, For each user, there would
be 10 jobs finished in an hour.

The maximum is more important than the average, no?  Couldn't there be a user that submits
a job every minute or more?  But, as I stated later, I think breaking directories by job-id
makes lookup simpler and gives us more explicit limits over directory sizes.  So I'd prefer
that to time-based directories.

> This looks fine. But when we have permissions in place, inserting user becomes difficult.

I'm not sure what you mean by "permissions in place" and "inserting user".  It seems that
the intent is for users to be able to directly read their own job history files from HDFS.
 It also seems like we don't want generally users to be able to read other's job history files.
 So, if we have all job history files in a single tree, then we'd want the directories in
that tree to be world readable, but the log files to be owned and readable by the job's submitter.
 Or, if we have per-user directories, we could make those readable only by that user, providing
greater privacy.  Is this what you mean?

When we Cluster.getJobHistoryUrl() we'll know the user's ID, so I don't see whether there's
a top-level directory per user changing things much, if that's what you're worried about.
 The more important decision it seems to me is how we break things into directories within
that.  Using job ids seems more scalable than using time-of-day.  Do you agree?

> Improve the way job history files are managed
> ---------------------------------------------
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Amareshwari Sriramadasu
>            Priority: Critical
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause
problems when there is a need to search the history folder (job-recovery etc). It would be
nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will
go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid,
date, jobname_ etc but using _username_ will make the search much more efficient and also
will not result into namespace explosion. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message