Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 41982 invoked from network); 2 Apr 2009 05:32:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Apr 2009 05:32:14 -0000 Received: (qmail 4563 invoked by uid 500); 2 Apr 2009 05:32:12 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 4459 invoked by uid 500); 2 Apr 2009 05:32:12 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 4449 invoked by uid 99); 2 Apr 2009 05:32:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 05:32:12 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [132.249.20.60] (HELO billthecat.sdsc.edu) (132.249.20.60) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 05:32:03 +0000 Received: from [192.168.1.100] (cpe-66-74-210-142.san.res.rr.com [66.74.210.142]) (authenticated bits=0) by billthecat.sdsc.edu (8.14.2/8.14.1/SDSCrelay/16) with ESMTP id n325VfXH005764 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NO) for ; Wed, 1 Apr 2009 22:31:41 -0700 (PDT) Message-Id: <32EF83E7-8EAF-4422-B09F-E65BA97A850A@sdsc.edu> From: Sriram Krishnan To: core-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=Apple-Mail-103-497574708 Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Strange Reduce Bahavior Date: Wed, 1 Apr 2009 22:31:35 -0700 X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-103-497574708 Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Hi all, I am new to this list, and relatively new to Hadoop itself. So if this question has been answered before, please point me to the right thread. We are investigating the use of Hadoop for processing of geo-spatial data. In its most basic form, out data is laid out in files, where every row has the format - {index, x, y, z, ....} I am writing some basic Hadoop programs for selecting data based on x and y values, and everything appears to work correctly. I have Hadoop 0.19.1 running in pseudo-distributed on a Linux box. However, as a academic exercise, I began writing some code that simply reads every single line of my input file, and does nothing else - I hoped to gain an understanding on how long it would take for Hadoop/HDFS to read the entire data set. My Map and Reduce functions are as follows: public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { // do nothing return; } public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { // do nothing return; } My understanding is that the above map function will produce no intermediate key/value pairs - and hence, the reduce function should take no time at all. However, when I run this code, Hadoop seems to spend an inordinate amount of time in the reduce phase. Here is the Hadoop output - 09/04/01 20:11:12 INFO mapred.JobClient: Running job: job_200904011958_0005 09/04/01 20:11:13 INFO mapred.JobClient: map 0% reduce 0% 09/04/01 20:11:21 INFO mapred.JobClient: map 3% reduce 0% 09/04/01 20:11:25 INFO mapred.JobClient: map 7% reduce 0% .... 09/04/01 20:13:17 INFO mapred.JobClient: map 96% reduce 0% 09/04/01 20:13:20 INFO mapred.JobClient: map 100% reduce 0% 09/04/01 20:13:30 INFO mapred.JobClient: map 100% reduce 4% 09/04/01 20:13:35 INFO mapred.JobClient: map 100% reduce 7% ... 09/04/01 20:14:05 INFO mapred.JobClient: map 100% reduce 25% 09/04/01 20:14:10 INFO mapred.JobClient: map 100% reduce 29% 09/04/01 20:14:15 INFO mapred.JobClient: Job complete: job_200904011958_0005 09/04/01 20:14:15 INFO mapred.JobClient: Counters: 15 09/04/01 20:14:15 INFO mapred.JobClient: File Systems 09/04/01 20:14:15 INFO mapred.JobClient: HDFS bytes read=1787707732 09/04/01 20:14:15 INFO mapred.JobClient: Local bytes read=10 09/04/01 20:14:15 INFO mapred.JobClient: Local bytes written=932 09/04/01 20:14:15 INFO mapred.JobClient: Job Counters 09/04/01 20:14:15 INFO mapred.JobClient: Launched reduce tasks=1 09/04/01 20:14:15 INFO mapred.JobClient: Launched map tasks=27 09/04/01 20:14:15 INFO mapred.JobClient: Data-local map tasks=27 09/04/01 20:14:15 INFO mapred.JobClient: Map-Reduce Framework 09/04/01 20:14:15 INFO mapred.JobClient: Reduce input groups=1 09/04/01 20:14:15 INFO mapred.JobClient: Combine output records=0 09/04/01 20:14:15 INFO mapred.JobClient: Map input records=44967808 09/04/01 20:14:15 INFO mapred.JobClient: Reduce output records=0 09/04/01 20:14:15 INFO mapred.JobClient: Map output bytes=2 09/04/01 20:14:15 INFO mapred.JobClient: Map input bytes=1787601210 09/04/01 20:14:15 INFO mapred.JobClient: Combine input records=0 09/04/01 20:14:15 INFO mapred.JobClient: Map output records=1 09/04/01 20:14:15 INFO mapred.JobClient: Reduce input records=0 As you can see, the reduce phase takes a little more than a minute - which is about a third of the execution time. However, the number of reduce tasks spawned is 1, and reduce input records is 0. Why does it spend so long on the reduce phase if there are 0 input records to be read? Furthermore, if the number of reduce jobs is 1, how is Hadoop able to report back the percentage completion of the reduce phase? Updating the number of reduce tasks using the JobConf.setNumReduceTasks() has no effect on the parallelism of map and reduce tasks. Another interesting aspect is that my Hadoop code to do a select on the input files based on x and y values runs faster than my above Hadoop code - the select code contains a map function that emits the selected rows as intermediate keys, while the reduce code is pretty much an identity function. In fact, in this case, I see parallel execution of map and reduce tasks. I had thought that my Select code should be slower - because not only is it reading every single line of input (similar to my above experiment), but it is also doing some writes based on the selection criteria. Thanks in advance for any pointers! Sriram -- Sriram Krishnan, Ph.D. San Diego Supercomputer Center http://www.sdsc.edu/~sriram --Apple-Mail-103-497574708--