hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed
Date Tue, 08 Jun 2010 08:16:16 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876594#action_12876594

Amar Kamat commented on MAPREDUCE-323:

Few comments
# W.r.t your [comment | http://tinyurl.com/2aado36], we could very well use the finishtime
of the job. This is very well published in the job summary, stored in the job status cache
within jobtracker and later archived to completed-job-status-store. Maybe we can reuse these
features (i.e the job status cache and status store).
# We should log jobhistory activities like 
  ## jobhistory folder regex used
  ## jobid to foldername mappings
  Logging will help in debugging and post mortem analysis.
# Formats can change across runs. How do we plan to take care of that. One thing we can do
it to have a unique folder per pattern for storing the files. The (unique) folder-name should
be based on the jobhistory structure pattern. This mapping of jobhistory folder regex to the
foldername should be logged. 
  Clients that need really old jobhistory files analyzed, will dig up the jobhistory folder
format, map it to the folder, provide the _username_, _jobid_ and _finishtime_ to get the
file. The client can get the _username_ and _finishtime_ by quering the JobTracker for the
job status (via completed-jobstatus-store). See _Future Steps #1_.
# How about keeping _N_ items in the top level directory and moving them to the appropriate
place only when the total item count crosses _N_. 
  Example (assume /done/%user/%jobid as the format and N=5)
  ## The first job gets added to /done/job1
  ## 5th job gets added to /done/job5
  ## 6th job gets added to /done/job6 and /done/job1 gets moves to /done/user1/job1
  ## and so on
So the movement happens only on overflow. The benefit of this change is that without any indexing,
we can show the recent N jobs on the jobhistory webui. This pattern can be enabled for all
subfolders also. So if the jobhistory format specified is %user/ then queries like '_give
the recent 5 items all the users_' can also be answered quickly.
# Webui should provide 2 views
   ## top/recent few (show jobs from the topmost level folder)
   ## browse-able view where YYYY/MM/DD etc is shows as it is. This can be configurable and
turned off for complicated structures like 00/00/00-99 etc, which the users might now be able
to make sense. Also there should be somekind of widget in JobHistory that given _username_,
_joibid_ and _finishtime_ provides the complete jobhistory filename. See _Future steps #2_.
# bq. .... He raised the issue that a practical cluster has more distinct users than we would
want to create DFS directories, especially if the directory structure is further split on
I would prefer username to be one of the configuration options. Since its configurable, it
can be turned off for clusters having lots of users.

Future steps :
# As of today, we have jobhistory files directly dumped in the done folder. We might want
to move these files in the format we want (for a good user experience). Maybe some kind of
offline admin tool can help here (maybe under mradmin?). It might make sense to name the final
jobhistory file (leaf-level) as $username_$jobid_$finishtime. This will enable use to restructure
job history files across various formats. 
# There should be someway to find out which regex/format was used given the jobtracker start
time (which is one of the components in jobid). To make it easier for clients, maybe the log
files related to jobhistory upadates can be published or the JobTracker should be in a position
to answer this.

> Improve the way job history files are managed
> ---------------------------------------------
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Dick King
>            Priority: Critical
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause
problems when there is a need to search the history folder (job-recovery etc). It would be
nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will
go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid,
date, jobname_ etc but using _username_ will make the search much more efficient and also
will not result into namespace explosion. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message