hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed
Date Thu, 10 Jun 2010 22:53:17 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614

Chris Douglas commented on MAPREDUCE-323:

The scope of this issue has not been well defined. The designs are arguing about the correct
subset of a database to implement for JobHistory, leaving a wide range of known (and as Allen
points out, unknown) use cases ill served. This will not converge quickly.

For purposes of consensus, this issue is a bug; the _existing_ functionality is not handled
efficiently. It should go without saying that the design should not be over-specific to today's
use cases, but the issue's focus should remain on solving the problems cited and servicing
the use cases already in the system. This is a misbehaving component, not a project implementing
a small database in HDFS. Perhaps the title should change to reflect this.

There are 3 operations to support (please amend as necessary):
# Lookup by JobID. This should not be worse than O\(log n) (and should be O\(1)), as it is
a frequent operation.
# Find a set of jobs run by a particular user
# Find a set of jobs with names matching a regex

(2) and (3) can require a scan, but the cost should be bounded. If there are common operator
activities (like archiving old history, etc) then the layout should support that, but arbitrary
queries are out of scope.

The problems with the flat hierarchy are, obviously, the cost of listing files both in the
JobTracker and NameNode. This can be ameliorated, somewhat, by HDFS-1091 and HDFS-985, but
further optimizations/caching are possible if one can assume that recent entries are more

format looks sound to me. Amar identified many complexities in implementing the configurable-schema,
mini-database proposal and in my opinion: while the solutions are feasible, the virtues of
a simpler fix for this issue outweigh the costs of solving those problems.

I particularly like the idea of bounding scans of JobHistory to _n_ entries, unless the user
requests a deeper search. Caching recent entries, metadata about which subdirectories are
sufficent for _n_ entries, etc. are all reasonable optimizations, but adopting the new layout
should be sufficient for this issue. Agreed?

> Improve the way job history files are managed
> ---------------------------------------------
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Dick King
>            Priority: Critical
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause
problems when there is a need to search the history folder (job-recovery etc). It would be
nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will
go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid,
date, jobname_ etc but using _username_ will make the search much more efficient and also
will not result into namespace explosion. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message