hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sharad Agarwal (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-864) Enhance JobClient API implementations to look at history files to get information about jobs that are not in memory
Date Fri, 04 Sep 2009 12:44:57 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751445#action_12751445

Sharad Agarwal commented on MAPREDUCE-864:

Had an offline discussion with Devaraj and Hemanth. Apart from the issue mentioned above,
one more issue was identified with this JIRA. The issue is that of consistency. This can't
be done just for Job data. To be consistent all job client apis should return data from HDFS
if not found in job tracker's memory. Consider the api like getJob(Jobid). To be consistent,
it should also look HDFS for completed jobs if data is not in job tracker. Looking into the
HDFS completed jobs folder and building up the job structures *efficiently* is a non-trivial
thing to do at this point.

So we agree that a better approach at this point would be:

Retain the contract that job clients will *only* see information which are in Jobtracker's
memory. Clients will get the very basic information of the completed jobs from job tracker's
retired cache (MAPREDUCE-817). 
Clients which need to drill down completed jobs' *TASK* level information will need to use
History parser. The assumption here is that such clients will be very few and mostly these
clients want to do analysis of the completed jobs. So it is better for them to use History
parser directly and keep the job client interface clean.
The only minor concern here is that many clients may just need to look at the counters which
are currently not cached in the retired job info. They will have to go to the History parser
path to retrieve them. There should be a easy way to get those. The proposal is to add counters
to the retired job cache. The idea is to just cache the job level information and not any
task level in the retired jobs cache. Some quick estimate for the memory consumption. Assuming
100 counters per job and 200 bytes per counter. For 1000 retired jobs, it comes to 100*200*1000
= 20 MB, which is quite manageable.

> Enhance JobClient API implementations to look at history files to get information about
jobs that are not in memory
> -------------------------------------------------------------------------------------------------------------------
>                 Key: MAPREDUCE-864
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-864
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: jobtracker
>            Reporter: Devaraj Das
>            Assignee: Sharad Agarwal
>             Fix For: 0.21.0
> MAPREDUCE-817 added an API to get the JobHistory URL from the JobTracker. This is useful
in two ways:
> 1) Users can use this API to get the URL, copy the history files to their local disk,
and, do processing on them
> 2) APIs like JobSubmissionProtocol.getJobCounters, can read a part of the history file,
and then return the information to the caller (if the job is not there in JT memory). This
would  mimic most of the CompletedJobsStatusStore functionality.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message