hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dick King (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed
Date Wed, 02 Jun 2010 19:34:43 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874758#action_12874758

Dick King commented on MAPREDUCE-323:

If the cluster configuration codes any time stamps, we have to create them.  We'll do this
the first time we make a filename for a given job.

Having done that, we'll have a map mapping job serial numbers to directory segments [which
we will intern; there will be many duplicates].

Having done _that_, we will we'll keep 250K of these; we'll drop the oldest one when we add
a new one that would otherwise add more than that.  We'll therefore use a {{TreeMap}} .  I
expect about 20-40 bytes per entry; 16 bytes each tree node, and 8 or 16 for the key which
would be an {{Integer}} .  Recall that the directory segments are interned and would essentially

This table only exists if there is a time stamp operator in the format string.  

> Improve the way job history files are managed
> ---------------------------------------------
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Dick King
>            Priority: Critical
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause
problems when there is a need to search the history folder (job-recovery etc). It would be
nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will
go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid,
date, jobname_ etc but using _username_ will make the search much more efficient and also
will not result into namespace explosion. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message