hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1533) reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects
Date Thu, 01 Apr 2010 07:37:27 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852271#action_12852271

Hemanth Yamijala commented on MAPREDUCE-1533:

Few comments on the patch:

- Move TaskSchedulingContext.JOB_SCHEDULING_INFO_FORMAT_STRING to JobSchedulingInfoHolder
- QueueSchedulingContext does not seem to be the right place to define JobSchedulingInfoHolder.
The container class manages data related to queues, and not a specific Job. Maybe it should
be a separate class.
- By making JobSchedulingInfoHolder.toString to be called outside heartbeats, we are addressing
the core problem in the issue. But there still exist a few calls to this method from within
a JobTracker lock - like CLI APIs like getAllJobs, etc., though they occur very infrequently
as compared to heartbeats. With that context, should we implement a more optimized version
of toString, maybe using StringBuilder (as was suggested elsewhere).
- The changes in JobQueue.updateStatsOnRunningJob and TaskDataView.getSlotsOccupied can be
avoided, but the intent of the change can still be met, by changing the algorithm in TaskDataView.getSlotsPerTask.
I am giving this idea based on preliminary patches in MAPREDUCE-1354. There, we optimized
getNumSlotsPer{Map|Reduce} to be unsynchronized, by making the corresponding variables volatile.
Hence, getSlotsPerTask can now be implemented as:
int getSlotsPerTask(JobInProgress job) {
  return job.getNumSlotsPerMap();
and likewise for reduces.
This makes it fewer changes to the patch.
- I would suggest a few documentation changes to better document the contract of the scheduling
-- Document that getSchedulingInfo returns a stringified representation of the job scheduling
info set in setJobSchedulingInfo.
-- Document that getJobSchedulingInfo will return the stringified representation of the job
scheduling info on the Client, but the actual object on the server. (Note that we are deserializing
the stringified representation of the scheduling info on the client, not the actual object
-- Document the intent of setJobSchedulingInfo - i.e. it is for optimization of heartbeats
and allows lazy construction of the stringified representation on a need basis. This is an
important design choice to capture, I think.
- Do we need mapred.JobStatus.{get|set}JobSchedulingInfo ? Can we not define them only in
mapreduce.JobStatus ?
- I don't think we need the cast in JobStatus.readFields casting the scheduling info string
to an Object, because this is allowed anyway.
- Given we are going to store the stringified representation of the scheduling info on the
client, should we retain the name of the JobStatus variable as schedulingInfo only ?

> reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects
> -----------------------------------------------------------------------------------------
>                 Key: MAPREDUCE-1533
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1533
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>            Reporter: Rajesh Balamohan
>            Assignee: Amar Kamat
>         Attachments: mapreduce-1533-v1.4.patch
> When short jobs are executed in hadoop with OutOfBandHeardBeat=true, JT executes heartBeat()
method heavily. This internally makes a call to CapacityTaskScheduler.updateQSIObjects().

> CapacityTaskScheduler.updateQSIObjects(), internally calls String.format() for setting
the job scheduling information. Based on the datastructure size of "jobQueuesManager" and
"queueInfoMap", the number of times String.format() gets executed becomes very high. String.format()
internally does pattern matching which turns to be out very heavy (This was revealed while
profiling JT. Almost 57% of time was spent in CapacityScheduler.assignTasks(), out of which
String.format() took 46%.
> Would it be possible to do String.format() only at the time of invoking JobInProgress.getSchedulingInfo?.
This might reduce the pressure on JT while processing heartbeats. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message