hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amr Awadallah <...@cloudera.com>
Subject Re: Job Startup Time
Date Tue, 14 Jul 2009 00:13:41 GMT
Mu,

  Though not a very good excuse, Hadoop wasn't originally designed for 
interactive latency, rather it focused on large scale throughput. That 
said, the hadoop developer community is working on improving the startup 
time for map-reduce jobs. Owen/Arun made a number of custom changes  for 
the terabyte sort benchmark which reduced the startup time to a couple 
of seconds, see:

http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html

   They promised that changes from that exercise will make their way 
slowly into hadoop trunk

Examples of other changes focused on improving job execution latency:

https://issues.apache.org/jira/browse/MAPREDUCE-463
https://issues.apache.org/jira/browse/HADOOP-6148
https://issues.apache.org/jira/browse/MAPREDUCE-318

  In the interim, if latency is very important for you, then you can 
take a look at HBase which does support lookups at very high speeds 
(though map-reduce style jobs will still have long startup time on top 
of it).

-- amr

Todd Lipcon wrote:
> Hi Mu,
>
> Small job overhead is something that has been worked on a bit in recent
> versions, but here's the gist of it (as best as I know, though I don't work
> much in this area of the code):
>
> - The JobTracker doesn't assign tasks forcefully to TaskTrackers. Instead,
> the TaskTrackers send heartbeats at a certain interval
> (MRConstants.HEARTBEAT_INTERVAL_MIN). The minimum interval is once every 3
> seconds. For every 100 nodes above 300, that interval increases by one
> second (MRConstants.CLUSTER_INCREMENT).
>
> - Because of this, each task from the JobTracker can take up to 3 seconds to
> get assigned to a TaskTracker.
>
> - I believe that the TaskTrackers also do not report Task Completion Events
> except as part of a Heartbeat. This means that after each task finishes,
> there can be another 3 second delay before the JobTracker finds out about
> it.
>
> - Though these things seem inefficient, the reasoning is that, in a large
> cluster of say 1000 nodes, the TTs could potentially overwhelm the
> JobTracker if the heartbeats were more frequent. With more nodes, the amount
> of time between a task being pending and a TT reporting a heartbeat is also
> likely to be small. Additionally, MapReduce is designed in general for large
> jobs where the amount of time spent in processing a task significantly
> eclipses the scheduling time.
>
> Given all of these delays, plus various amounts of time taken in copying
> your job JAR to and from HDFS, even an "empty" job can take many seconds.
> Around 20 sounds about right from my experience.
>
> Hope that helps
> -Todd
>
>
> On Sun, Jul 12, 2009 at 9:52 PM, Mu Qiao <qiaomuf@gmail.com> wrote:
>
>   
>> Hi, everyone
>>
>> I've tested the hadoop environment I've set up. I noticed that it takes 24s
>> to run a 2 mapper, 1 reducer job with empty input.
>> Is it a reasonable time to run a do-nothing job? Why it takes so much time?
>>
>> Thanks
>>
>> --
>> Best wishes,
>> Qiao Mu
>> MOE KLINNS Lab and SKLMS Lab, Xi'an Jiaotong University
>> Department of Computer Science and Technology, Xi’an Jiaotong University
>> TEL: 15991676983
>> E-mail: qiaomuf@gmail.com
>>
>>     
>
>   

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message