hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amareshwari Sriramadasu (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed
Date Fri, 30 Oct 2009 05:59:02 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771818#action_12771818
] 

Amareshwari Sriramadasu commented on MAPREDUCE-323:
---------------------------------------------------

bq. I would also request for removing jobname from the history filename. 
This is done as part of MAPREDUCE-157. I will port the change to Yahoo! distribution with
this patch.

Options for the directory structure of the history files are
# {$hadoop.log.dir}/history/done/YYYY-MM-DD/
# {$hadoop.log.dir}/history/done/YYYY-MM-DD/USER
# {$hadoop.log.dir}/history/done/USER/YYYY-MM-DD/
# {$hadoop.log.dir}/history/done/USER/YYYY/MM/DD
# {$hadoop.log.dir}/history/done/YYYY/MM/DD/USER

For the directory structure, I would go with option#1, because it is easy to maintain.  We
can add more when needed.

We can have a cache in JobTracker to look up the history location for each jobid (can be moved
HistoryServer when we move history to a separate server). We can have JT maintain the cache
for last 20 days history (configurable).
Now, the file name of the history log file is <jobid>_<user>.log.  We have job
id about 20 characters long, and if user name is about 25 characters, the jobhistory file
name is of length about 50 bytes. For a given jobid, the cache entry in JT will be of size
at most 100 bytes. 50,000 such entries would make it 5MB. 
We can have a configuration to limit the number entries in the cache, default value being
50,000.
Thus, the cache is controlled by the number of the days for which the cache is maintained
and is also capped by number of entries in the cache.

If the history location is not present in the JT cache, JT history web ui does not show the
page. 
An Interested user can call, the api Cluster.getJobHistoryUrl(JobID, boolean getFromDFS) to
get the url from the DFS, if it is not present in JT.
We can add *bin/hadoop job -historyurl <jobid> * to get the historyurl for the jobid
from JT cache. We can add another argument to the command to get the history url from DFS
if it is not present in JT cache.
Then, HistoryViewer can be used to view the history on command line. 

Thoughts?

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause
problems when there is a need to search the history folder (job-recovery etc). It would be
nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will
go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid,
date, jobname_ etc but using _username_ will make the search much more efficient and also
will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message