hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matei Zaharia (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-4664) Parallelize job initialization
Date Sun, 16 Nov 2008 19:21:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matei Zaharia updated HADOOP-4664:
----------------------------------

    Attachment: parallel-job-init-v1.patch

Here is a patch for this issue. The patch adds multiple job init threads in the EagerTaskInitializationListener,
which is used to initialize tasks by the default scheduler (JobQueueTaskScheduler) and the
fair scheduler. The capacity scheduler actually initializes jobs in its assignTasks method,
which happens in an RPC handler thread, so it can already do this in parallel (although it
may be worth modifying it to have a separate set of job init threads so that the RPC handlers
don't block waiting for a job to initialize).

This patch also makes the CachedDNSToSwitchMap use a ConcurrentHashMap instead of a TreeMap
for its rack resolving cache to avoid errors caused by multiple writes. (Cache-hit reads require
no locks with ConcurrentHashMap.) Apart from the possibility of multiple writes to the resolution
cache, I think I saw no other potentially conflict-inducing operations in initTasks, but I'd
really welcome a second pair of eyes to look at it.

The number of job init threads is configurable as mapred.jobinit.threads. I set it to 4 by
default, but let me know if there are any objections.

> Parallelize job initialization
> ------------------------------
>
>                 Key: HADOOP-4664
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4664
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Matei Zaharia
>         Attachments: parallel-job-init-v1.patch
>
>
> The job init thread currently initializes one job at a time. However, this is a lengthy
and partly IO-bound process because all of the job's block locations need to be resolved through
the namenode and a map of them needs to be built. It can take tens of seconds. As a result,
the cluster sometimes initializes jobs too slowly for full utilization to be achieved, if
there are many small jobs queued up. It would be better to have a pool of threads that initialize
multiple jobs in parallel. One thing to be careful of, however, is not causing deadlocks or
holding locks for too long in these threads.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message