hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: Job scheduling (Re: Unable to run more than one job concurrently)
Date Mon, 22 May 2006 04:11:23 GMT

You have no guarantee that your time sensitive data is safe /  
committed until after your reduce has completed.  If you care about  
reliability or data integrity, simply run a full map-reduce job in  
your collection window and store the result in the HDFS.

Do expensive post processing you have a quarter to complete as  
another job.  Being able to preempt a long job with a time sensitive  
short job seems to really be your requirement.

On May 21, 2006, at 11:22 AM, Rod Taylor wrote:

>>> (2) Have a per-job total task count limit. Currently, we  
>>> establish the
>>> number of tasks each node runs, and how many map or reduce tasks  
>>> we have
>>> total in a given job. But it would be great if we could set a  
>>> ceiling on the
>>> number of tasks that run concurrently for a given job. This may  
>>> help with
>>> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer  
>>> concurrent
>>> jobs would be fine?).
>> I like this idea.  So if the highest-priority job is already  
>> running at
>> its task limit, then tasks can be run from the next highest-priority
>> job.  Should there be separate limits for maps and reduces?
> Limits for map and reduce are useful for a job class. Not so much  
> for a
> specific job instance.  Data collection may be best achieved with 15
> parallel maps pulling data from remote data sources. But if the fact
> there are 3 from one job and 12 from another isn't important. It's
> important that 15 makes best use of resources.
> A different priority for map and reduce would also be useful. Many  
> times
> data collection in a set timeframe is far more important than reducing
> it for storage or post processing, particularly when data  
> collection is
> retrieving it from a remote resource.
> Data warehousing activities often require that data collection occur
> once a night between set hours (very high priority) but processing of
> the data collected can occur any time until the end of the quarter.
> For Nutch, with both of the above you should be able to achieve N  
> number
> of Fetch Map processes running at all times with everything else being
> secondary within the remaining resources. This could make use of  
> 100% of
> available remote bandwidth.
> -- 
> Rod Taylor <rbt@sitesell.com>

View raw message