hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amareshwari Sriramadasu (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed
Date Mon, 02 Nov 2009 06:32:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772432#action_12772432

Amareshwari Sriramadasu commented on MAPREDUCE-323:

bq. Nick Rettinghouse, Tim Williamson, and Rajiv Chittajallu all suggested a preference for
per-hour directories, in particular, USER/YYYY/MM/DD/HH, an option you did not list. Should
we perhaps err on the side of a deeper structure, to ensure that we don't have to re-structure
things again?
Per-hour directories look like over-kill. On the average, For each user, there would be 10
jobs finished in an hour.

bq. However implementing Cluster.getJobHistoryUrl() would be expensive for archived jobs,
since the jobtracker must search the entire directory tree.
Here, JobTracker need not  search the entire directory tree. If JobTracker does not have it
in the cache, Job Client itself can do the search.

bq. Perhaps the directory structure should instead be based purely on the job ID? E.g., something
like: jobtrackerstarttime/00/00/00
This looks fine. But when we have permissions in place, inserting user becomes difficult.

> Improve the way job history files are managed
> ---------------------------------------------
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Amareshwari Sriramadasu
>            Priority: Critical
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause
problems when there is a need to search the history folder (job-recovery etc). It would be
nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will
go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid,
date, jobname_ etc but using _username_ will make the search much more efficient and also
will not result into namespace explosion. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message