hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rod Taylor <...@sitesell.com>
Subject Re: Job scheduling (Re: Unable to run more than one job concurrently)
Date Sun, 21 May 2006 18:22:14 GMT

> > (2) Have a per-job total task count limit. Currently, we establish the
> > number of tasks each node runs, and how many map or reduce tasks we have
> > total in a given job. But it would be great if we could set a ceiling on the
> > number of tasks that run concurrently for a given job. This may help with
> > Andrzej's fetcher (since it is bandwidth constrained, maybe fewer concurrent
> > jobs would be fine?).
> I like this idea.  So if the highest-priority job is already running at 
> its task limit, then tasks can be run from the next highest-priority 
> job.  Should there be separate limits for maps and reduces?

Limits for map and reduce are useful for a job class. Not so much for a
specific job instance.  Data collection may be best achieved with 15
parallel maps pulling data from remote data sources. But if the fact
there are 3 from one job and 12 from another isn't important. It's
important that 15 makes best use of resources.

A different priority for map and reduce would also be useful. Many times
data collection in a set timeframe is far more important than reducing
it for storage or post processing, particularly when data collection is
retrieving it from a remote resource.

Data warehousing activities often require that data collection occur
once a night between set hours (very high priority) but processing of
the data collected can occur any time until the end of the quarter.

For Nutch, with both of the above you should be able to achieve N number
of Fetch Map processes running at all times with everything else being
secondary within the remaining resources. This could make use of 100% of
available remote bandwidth.

Rod Taylor <rbt@sitesell.com>

View raw message