Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=received:mime-version:in-reply-to:references:content-type:
	message-id:content-transfer-encoding:from:subject:date:to:x-mailer:
	return-path:x-originalarrivaltime;
	b=SY1Z5ZKrKQSQ6FcIeXH5NubESQt2pl9r7vWPlUMGbjm0rc23IgT2VikJREwKS2x7
Mime-Version: 1.0 (Apple Message framework v750)
In-Reply-To: <1148235734.74169.31.camel@home>
References: <20060519180104.D58BC10FB004@asf.osuosl.org>
 <446E158B.9060203@apache.org> <1148235734.74169.31.camel@home>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <D1D92A67-31FA-402D-9F13-69C804553705@yahoo-inc.com>
Content-Transfer-Encoding: 7bit
From: Eric Baldeschwieler <eric14@yahoo-inc.com>
Subject: Re: Job scheduling (Re: Unable to run more than one job concurrently)
Date: Sun, 21 May 2006 21:11:23 -0700
To: hadoop-dev@lucene.apache.org

??

You have no guarantee that your time sensitive data is safe /  
committed until after your reduce has completed.  If you care about  
reliability or data integrity, simply run a full map-reduce job in  
your collection window and store the result in the HDFS.

Do expensive post processing you have a quarter to complete as  
another job.  Being able to preempt a long job with a time sensitive  
short job seems to really be your requirement.

On May 21, 2006, at 11:22 AM, Rod Taylor wrote:

>
>>> (2) Have a per-job total task count limit. Currently, we  
>>> establish the
>>> number of tasks each node runs, and how many map or reduce tasks  
>>> we have
>>> total in a given job. But it would be great if we could set a  
>>> ceiling on the
>>> number of tasks that run concurrently for a given job. This may  
>>> help with
>>> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer  
>>> concurrent
>>> jobs would be fine?).
>>
>> I like this idea.  So if the highest-priority job is already  
>> running at
>> its task limit, then tasks can be run from the next highest-priority
>> job.  Should there be separate limits for maps and reduces?
>
> Limits for map and reduce are useful for a job class. Not so much  
> for a
> specific job instance.  Data collection may be best achieved with 15
> parallel maps pulling data from remote data sources. But if the fact
> there are 3 from one job and 12 from another isn't important. It's
> important that 15 makes best use of resources.
>
> A different priority for map and reduce would also be useful. Many  
> times
> data collection in a set timeframe is far more important than reducing
> it for storage or post processing, particularly when data  
> collection is
> retrieving it from a remote resource.
>
>
> Data warehousing activities often require that data collection occur
> once a night between set hours (very high priority) but processing of
> the data collected can occur any time until the end of the quarter.
>
>
> For Nutch, with both of the above you should be able to achieve N  
> number
> of Fetch Map processes running at all times with everything else being
> secondary within the remaining resources. This could make use of  
> 100% of
> available remote bandwidth.
>
> -- 
> Rod Taylor <rbt@sitesell.com>
>