Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 24549 invoked from network); 26 Jun 2010 18:26:33 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Jun 2010 18:26:33 -0000 Received: (qmail 78106 invoked by uid 500); 26 Jun 2010 18:26:30 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 78070 invoked by uid 500); 26 Jun 2010 18:26:29 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 78062 invoked by uid 99); 26 Jun 2010 18:26:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 26 Jun 2010 18:26:29 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=RCVD_IN_DNSWL_NONE,RCVD_NUMERIC_HELO,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.173] (HELO mrout3.yahoo.com) (216.145.54.173) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 26 Jun 2010 18:26:20 +0000 Received: from SNV-EXBH01.ds.corp.yahoo.com (snv-exbh01.ds.corp.yahoo.com [207.126.227.249]) by mrout3.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id o5QIMtQS006759 for ; Sat, 26 Jun 2010 11:22:55 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=received:references:message-id:from:to:in-reply-to: content-type:thread-topic:thread-index: content-transfer-encoding:mime-version:subject:date:cc:return-path:x-originalarrivaltime; b=1CAc+25EUprmwlhAxhlELqgbOrSKJU+YwMgt43AZsILge49+YX0qXxmBG45O22Es Received: from SNV-EXVS04.ds.corp.yahoo.com ([207.126.227.240]) by SNV-EXBH01.ds.corp.yahoo.com with Microsoft SMTPSVC(6.0.3790.4675); Sat, 26 Jun 2010 11:22:55 -0700 Received: from 207.126.227.185 ([207.126.227.185]) by SNV-EXVS04.ds.corp.yahoo.com ([207.126.227.185]) with Microsoft Exchange Server HTTP-DAV ; Sat, 26 Jun 2010 18:22:54 +0000 References: Message-ID: <33F49762-319A-4EC1-B7C2-1C99A617B806@yahoo-inc.com> From: "Vitthal Gogate" To: In-Reply-To: Content-Type: text/plain; format=flowed; delsp=yes; charset="us-ascii" Thread-Topic: Reduce in hadoop streaming takes longer Thread-Index: AcsVXJgPLl5dCBLOSoKA+cDCHgR4Xg== Content-Transfer-Encoding: 7bit MIME-Version: 1.0 (iPhone Mail 7D11) Subject: Re: Reduce in hadoop streaming takes longer Date: Sat, 26 Jun 2010 11:15:03 -0700 Cc: X-OriginalArrivalTime: 26 Jun 2010 18:22:55.0633 (UTC) FILETIME=[98E24010:01CB155C] X-Virus-Checked: Checked by ClamAV on apache.org Are u using combiner in in streaming? It should help reduce shuffle time considerably Sent from my iPhone On Jun 26, 2010, at 1:16 AM, "Tanveer Faruquie" wrote: > Hi, > > I am trying to use Hadoop streaming to evaluate a set of data points > for a given rule. In the end I have to find on how many data points > the rule was true. Each data point is on a single line and the mapper > outputs either 1 or 0 if the rule for that data point is true or false > respectively. The reducer sums the output of mapper to give the final > count of the rule on all the examples. > > While running this using hadoop streaming I notice that the reduce > phase takes considerably more time than the mapper. Moreover the > reduce times increases with data size. Since my job is equivalent to > counting the number of words in a file I timed a streaming version of > word count and the regular java version where each word is replaced by > the same word ('foo'). The mapper and reducers are written in python > for streaming. I ran this on 1.5 GB of data. > > The streaming version took 29mins while java version took around > 2min44sec on the same hadoop cluster (consisting of 3 machines). The > map phase in both the cases took approximately the same amount of time > whereas the reduce phase took much longer for streaming. > > Does anyone know what might be the problem and if possible how to > reduce the time taken by reduce phase for streaming? > > Regards, > Tanveer > > PS: These are the output for the test run. > > using java for word count > > $ time $HADOOP_HOME/bin/hadoop jar ./wordcount.jar WordCount text > text-out > 10/06/25 21:58:15 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the > same. > 10/06/25 21:58:15 INFO mapred.FileInputFormat: Total input paths to > process : 86 > 10/06/25 21:58:15 INFO mapred.FileInputFormat: Total input paths to > process : 86 > 10/06/25 21:58:16 INFO mapred.JobClient: Running job: > job_201006252042_0005 > 10/06/25 21:58:17 INFO mapred.JobClient: map 0% reduce 0% > 10/06/25 21:58:30 INFO mapred.JobClient: map 2% reduce 0% > 10/06/25 21:58:35 INFO mapred.JobClient: map 4% reduce 0% > 10/06/25 21:58:38 INFO mapred.JobClient: map 6% reduce 0% > 10/06/25 21:58:40 INFO mapred.JobClient: map 7% reduce 0% > 10/06/25 21:58:42 INFO mapred.JobClient: map 9% reduce 0% > 10/06/25 21:58:44 INFO mapred.JobClient: map 10% reduce 0% > 10/06/25 21:58:45 INFO mapred.JobClient: map 13% reduce 0% > 10/06/25 21:58:46 INFO mapred.JobClient: map 15% reduce 0% > 10/06/25 21:58:48 INFO mapred.JobClient: map 18% reduce 0% > 10/06/25 21:58:49 INFO mapred.JobClient: map 20% reduce 0% > 10/06/25 21:58:50 INFO mapred.JobClient: map 22% reduce 0% > 10/06/25 21:58:51 INFO mapred.JobClient: map 24% reduce 0% > 10/06/25 21:58:52 INFO mapred.JobClient: map 28% reduce 0% > 10/06/25 21:58:54 INFO mapred.JobClient: map 31% reduce 0% > 10/06/25 21:58:55 INFO mapred.JobClient: map 35% reduce 0% > 10/06/25 21:58:56 INFO mapred.JobClient: map 36% reduce 0% > 10/06/25 21:58:57 INFO mapred.JobClient: map 38% reduce 0% > 10/06/25 21:58:58 INFO mapred.JobClient: map 44% reduce 0% > 10/06/25 21:58:59 INFO mapred.JobClient: map 46% reduce 0% > 10/06/25 21:59:00 INFO mapred.JobClient: map 51% reduce 0% > 10/06/25 21:59:01 INFO mapred.JobClient: map 55% reduce 0% > 10/06/25 21:59:02 INFO mapred.JobClient: map 58% reduce 0% > 10/06/25 21:59:03 INFO mapred.JobClient: map 63% reduce 0% > 10/06/25 21:59:04 INFO mapred.JobClient: map 68% reduce 0% > 10/06/25 21:59:05 INFO mapred.JobClient: map 74% reduce 0% > 10/06/25 21:59:06 INFO mapred.JobClient: map 76% reduce 0% > 10/06/25 21:59:07 INFO mapred.JobClient: map 82% reduce 0% > 10/06/25 21:59:08 INFO mapred.JobClient: map 90% reduce 0% > 10/06/25 21:59:14 INFO mapred.JobClient: map 91% reduce 0% > 10/06/25 21:59:19 INFO mapred.JobClient: map 92% reduce 0% > 10/06/25 21:59:21 INFO mapred.JobClient: map 93% reduce 0% > 10/06/25 21:59:23 INFO mapred.JobClient: map 95% reduce 0% > 10/06/25 21:59:28 INFO mapred.JobClient: map 95% reduce 4% > 10/06/25 21:59:33 INFO mapred.JobClient: map 95% reduce 6% > 10/06/25 21:59:34 INFO mapred.JobClient: map 96% reduce 6% > 10/06/25 21:59:38 INFO mapred.JobClient: map 96% reduce 8% > 10/06/25 21:59:39 INFO mapred.JobClient: map 97% reduce 8% > 10/06/25 21:59:42 INFO mapred.JobClient: map 98% reduce 8% > 10/06/25 21:59:43 INFO mapred.JobClient: map 98% reduce 9% > 10/06/25 21:59:46 INFO mapred.JobClient: map 99% reduce 9% > 10/06/25 21:59:48 INFO mapred.JobClient: map 99% reduce 11% > 10/06/25 21:59:50 INFO mapred.JobClient: map 99% reduce 12% > 10/06/25 21:59:55 INFO mapred.JobClient: map 99% reduce 13% > 10/06/25 21:59:58 INFO mapred.JobClient: map 99% reduce 15% > 10/06/25 22:00:03 INFO mapred.JobClient: map 99% reduce 16% > 10/06/25 22:00:04 INFO mapred.JobClient: map 100% reduce 16% > 10/06/25 22:00:05 INFO mapred.JobClient: map 100% reduce 17% > 10/06/25 22:00:10 INFO mapred.JobClient: map 100% reduce 18% > 10/06/25 22:00:13 INFO mapred.JobClient: map 100% reduce 19% > 10/06/25 22:00:15 INFO mapred.JobClient: map 100% reduce 20% > 10/06/25 22:00:18 INFO mapred.JobClient: map 100% reduce 21% > 10/06/25 22:00:20 INFO mapred.JobClient: map 100% reduce 22% > 10/06/25 22:00:25 INFO mapred.JobClient: map 100% reduce 23% > 10/06/25 22:00:28 INFO mapred.JobClient: map 100% reduce 24% > 10/06/25 22:00:30 INFO mapred.JobClient: map 100% reduce 25% > 10/06/25 22:00:33 INFO mapred.JobClient: map 100% reduce 26% > 10/06/25 22:00:40 INFO mapred.JobClient: map 100% reduce 28% > 10/06/25 22:00:45 INFO mapred.JobClient: map 100% reduce 29% > 10/06/25 22:00:48 INFO mapred.JobClient: map 100% reduce 30% > 10/06/25 22:00:50 INFO mapred.JobClient: map 100% reduce 31% > 10/06/25 22:00:55 INFO mapred.JobClient: map 100% reduce 32% > 10/06/25 22:00:57 INFO mapred.JobClient: Job complete: > job_201006252042_0005 > 10/06/25 22:00:57 INFO mapred.JobClient: Counters: 17 > 10/06/25 22:00:57 INFO mapred.JobClient: File Systems > 10/06/25 22:00:57 INFO mapred.JobClient: HDFS bytes > read=1563645386 > 10/06/25 22:00:58 INFO mapred.JobClient: HDFS bytes written=14 > 10/06/25 22:00:58 INFO mapred.JobClient: Local bytes read=62374 > 10/06/25 22:00:58 INFO mapred.JobClient: Local bytes written=68894 > 10/06/25 22:00:58 INFO mapred.JobClient: Job Counters > 10/06/25 22:00:58 INFO mapred.JobClient: Launched reduce tasks=1 > 10/06/25 22:00:58 INFO mapred.JobClient: Rack-local map tasks=62 > 10/06/25 22:00:58 INFO mapred.JobClient: Launched map tasks=116 > 10/06/25 22:00:58 INFO mapred.JobClient: Data-local map tasks=46 > 10/06/25 22:00:58 INFO mapred.JobClient: Map-Reduce Framework > 10/06/25 22:00:58 INFO mapred.JobClient: Reduce input groups=1 > 10/06/25 22:00:58 INFO mapred.JobClient: Combine output > records=1001 > 10/06/25 22:00:58 INFO mapred.JobClient: Map input > records=33325068 > 10/06/25 22:00:58 INFO mapred.JobClient: Reduce output records=1 > 10/06/25 22:00:58 INFO mapred.JobClient: Map output bytes=1892793728 > 10/06/25 22:00:58 INFO mapred.JobClient: Map input > bytes=1563555252 > 10/06/25 22:00:58 INFO mapred.JobClient: Combine input records=236600216 > 10/06/25 22:00:58 INFO mapred.JobClient: Map output records=236599216 > 10/06/25 22:00:58 INFO mapred.JobClient: Reduce input records=1 > > real 2m44.426s > user 0m0.784s > sys 0m0.228s > > > ----------------- > > using streaming for word count > > > $ time /hadoop-0.18.3/bin/hadoop jar > /hadoop-0.18.3/contrib/streaming/hadoop-0.18.3-streaming.jar -file > ./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer > ./reducer.py -input text -output text-out > additionalConfSpec_:null > null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming > packageJobJar: [./mapper.py, ./reducer.py, > /home/user/hroot/tmp/hadoop-unjar539717104050104853/] [] > /tmp/streamjob5333570157243247862.jar tmpDir=null > 10/06/25 22:02:22 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the > same. > 10/06/25 22:02:22 INFO mapred.FileInputFormat: Total input paths to > process : 86 > 10/06/25 22:02:22 INFO mapred.FileInputFormat: Total input paths to > process : 86 > 10/06/25 22:02:22 INFO streaming.StreamJob: getLocalDirs(): > [/home/user/hroot/tmp/mapred/local] > 10/06/25 22:02:22 INFO streaming.StreamJob: Running job: > job_201006252042_0006 > 10/06/25 22:02:22 INFO streaming.StreamJob: To kill this job, run: > 10/06/25 22:02:22 INFO streaming.StreamJob: > /home/user/hadoop-0.18.3/bin/../bin/hadoop job > -Dmapred.job.tracker=hadoop1:19001 -kill job_201006252042_0006 > 10/06/25 22:02:22 INFO streaming.StreamJob: Tracking URL: > http://hadoop1:50030/jobdetails.jsp?jobid=job_201006252042_0006 > 10/06/25 22:02:24 INFO streaming.StreamJob: map 0% reduce 0% > 10/06/25 22:02:34 INFO streaming.StreamJob: map 1% reduce 0% > 10/06/25 22:02:39 INFO streaming.StreamJob: map 2% reduce 0% > 10/06/25 22:02:44 INFO streaming.StreamJob: map 3% reduce 0% > 10/06/25 22:02:49 INFO streaming.StreamJob: map 5% reduce 0% > 10/06/25 22:02:50 INFO streaming.StreamJob: map 6% reduce 0% > 10/06/25 22:02:55 INFO streaming.StreamJob: map 7% reduce 0% > 10/06/25 22:02:56 INFO streaming.StreamJob: map 10% reduce 0% > 10/06/25 22:02:57 INFO streaming.StreamJob: map 12% reduce 0% > 10/06/25 22:02:59 INFO streaming.StreamJob: map 13% reduce 0% > 10/06/25 22:03:01 INFO streaming.StreamJob: map 14% reduce 0% > 10/06/25 22:03:02 INFO streaming.StreamJob: map 15% reduce 0% > 10/06/25 22:03:03 INFO streaming.StreamJob: map 21% reduce 0% > 10/06/25 22:03:04 INFO streaming.StreamJob: map 23% reduce 0% > 10/06/25 22:03:07 INFO streaming.StreamJob: map 26% reduce 0% > 10/06/25 22:03:09 INFO streaming.StreamJob: map 27% reduce 0% > 10/06/25 22:03:10 INFO streaming.StreamJob: map 28% reduce 0% > 10/06/25 22:03:11 INFO streaming.StreamJob: map 30% reduce 0% > 10/06/25 22:03:12 INFO streaming.StreamJob: map 33% reduce 0% > 10/06/25 22:03:15 INFO streaming.StreamJob: map 36% reduce 0% > 10/06/25 22:03:16 INFO streaming.StreamJob: map 40% reduce 0% > 10/06/25 22:03:17 INFO streaming.StreamJob: map 43% reduce 0% > 10/06/25 22:03:20 INFO streaming.StreamJob: map 48% reduce 0% > 10/06/25 22:03:21 INFO streaming.StreamJob: map 50% reduce 0% > 10/06/25 22:03:22 INFO streaming.StreamJob: map 52% reduce 0% > 10/06/25 22:03:23 INFO streaming.StreamJob: map 54% reduce 0% > 10/06/25 22:03:26 INFO streaming.StreamJob: map 56% reduce 0% > 10/06/25 22:03:27 INFO streaming.StreamJob: map 58% reduce 0% > 10/06/25 22:03:28 INFO streaming.StreamJob: map 62% reduce 0% > 10/06/25 22:03:30 INFO streaming.StreamJob: map 64% reduce 0% > 10/06/25 22:03:31 INFO streaming.StreamJob: map 67% reduce 0% > 10/06/25 22:03:32 INFO streaming.StreamJob: map 68% reduce 0% > 10/06/25 22:03:35 INFO streaming.StreamJob: map 70% reduce 0% > 10/06/25 22:03:36 INFO streaming.StreamJob: map 76% reduce 0% > 10/06/25 22:03:37 INFO streaming.StreamJob: map 79% reduce 3% > 10/06/25 22:03:40 INFO streaming.StreamJob: map 82% reduce 3% > 10/06/25 22:03:41 INFO streaming.StreamJob: map 84% reduce 3% > 10/06/25 22:03:42 INFO streaming.StreamJob: map 89% reduce 5% > 10/06/25 22:03:49 INFO streaming.StreamJob: map 92% reduce 8% > 10/06/25 22:03:53 INFO streaming.StreamJob: map 93% reduce 9% > 10/06/25 22:03:56 INFO streaming.StreamJob: map 94% reduce 9% > 10/06/25 22:03:59 INFO streaming.StreamJob: map 95% reduce 10% > 10/06/25 22:04:01 INFO streaming.StreamJob: map 96% reduce 10% > 10/06/25 22:04:05 INFO streaming.StreamJob: map 97% reduce 13% > 10/06/25 22:04:10 INFO streaming.StreamJob: map 98% reduce 15% > 10/06/25 22:04:15 INFO streaming.StreamJob: map 99% reduce 16% > 10/06/25 22:04:20 INFO streaming.StreamJob: map 99% reduce 18% > 10/06/25 22:04:21 INFO streaming.StreamJob: map 100% reduce 18% > 10/06/25 22:04:25 INFO streaming.StreamJob: map 100% reduce 19% > 10/06/25 22:04:30 INFO streaming.StreamJob: map 100% reduce 20% > 10/06/25 22:04:35 INFO streaming.StreamJob: map 100% reduce 22% > 10/06/25 22:04:45 INFO streaming.StreamJob: map 100% reduce 23% > 10/06/25 22:04:50 INFO streaming.StreamJob: map 100% reduce 25% > 10/06/25 22:04:55 INFO streaming.StreamJob: map 100% reduce 26% > 10/06/25 22:05:00 INFO streaming.StreamJob: map 100% reduce 28% > 10/06/25 22:05:05 INFO streaming.StreamJob: map 100% reduce 29% > 10/06/25 22:05:10 INFO streaming.StreamJob: map 100% reduce 30% > 10/06/25 22:05:15 INFO streaming.StreamJob: map 100% reduce 31% > 10/06/25 22:05:20 INFO streaming.StreamJob: map 100% reduce 32% > 10/06/25 22:05:30 INFO streaming.StreamJob: map 100% reduce 33% > 10/06/25 22:07:41 INFO streaming.StreamJob: map 100% reduce 67% > 10/06/25 22:31:52 INFO streaming.StreamJob: map 100% reduce 100% > 10/06/25 22:31:52 INFO streaming.StreamJob: Job complete: > job_201006252042_0006 > 10/06/25 22:31:52 INFO streaming.StreamJob: Output: text-out > > real 29m31.166s > user 0m1.060s > sys 0m0.216s