Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 523F8D310 for ; Thu, 16 Aug 2012 14:27:50 +0000 (UTC) Received: (qmail 3763 invoked by uid 500); 16 Aug 2012 14:27:45 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 3459 invoked by uid 500); 16 Aug 2012 14:27:45 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 3452 invoked by uid 99); 16 Aug 2012 14:27:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Aug 2012 14:27:45 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of anilgupta84@gmail.com designates 209.85.160.48 as permitted sender) Received: from [209.85.160.48] (HELO mail-pb0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Aug 2012 14:27:37 +0000 Received: by pbbrp16 with SMTP id rp16so2040469pbb.35 for ; Thu, 16 Aug 2012 07:27:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:references:from:content-type:x-mailer:in-reply-to :message-id:date:to:content-transfer-encoding:mime-version; bh=n26CtSvecK2DwofQtLBsc8kCV4SYpILzuHR9huM3VRs=; b=QUaKyop2XUghkTn+1N2WesoHxcfE9dxzB2EOaXdWYg5Nh4qIbck/KdAPlEshWP+boF UXCi/xW1QfN7/Hg1Av0Ns5sFQH8zu8Q5OLu+dorVQhSLl3xr0hw7cp609E7FHbDxZCIs 8eKajIlKSWMzKfCuuf+RBSnFQ9MsmZ3AO6/LymBGavrJt0Xkl412cLNWYspRC3YAvUfb LM1vlTnL3aHqunrYNmBnVDtAJLvH5TFCgwKgL/cvOvLWpuDsmbTZe67xn/2TAgpxNIwa 8QFvS2Fw0LmABGxV/dMc+ouMdndBXxhZFKe0DFfj3cOwknt5mVZT8kLKEuX16elk5EuU cigA== Received: by 10.68.231.10 with SMTP id tc10mr3648124pbc.107.1345127236744; Thu, 16 Aug 2012 07:27:16 -0700 (PDT) Received: from [192.168.1.130] (cpe-72-129-67-128.socal.res.rr.com. [72.129.67.128]) by mx.google.com with ESMTPS id kt2sm2765564pbc.73.2012.08.16.07.27.15 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 16 Aug 2012 07:27:15 -0700 (PDT) Subject: Re: Number of Maps running more than expected References: From: Anil Gupta Content-Type: multipart/alternative; boundary=Apple-Mail-D3839967-33DB-40B6-BCA7-24B7071122EA X-Mailer: iPhone Mail (9B206) In-Reply-To: Message-Id: <85CC6A30-BF6F-4CEB-9E92-6AA2D73C596B@gmail.com> Date: Thu, 16 Aug 2012 07:27:13 -0700 To: "user@hadoop.apache.org" Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (1.0) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-D3839967-33DB-40B6-BCA7-24B7071122EA Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Hi Gaurav, Did you turn off speculative execution? Best Regards, Anil On Aug 16, 2012, at 7:13 AM, Gaurav Dasgupta wrote: > Hi users, > =20 > I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all t= he 12 nodes and 1 node running the Job Tracker). > In order to perform a WordCount benchmark test, I did the following: > Executed "RandomTextWriter" first to create 100 GB data (Note that I have c= hanged the "test.randomtextwrite.total_bytes" parameter only, rest all are k= ept default). > Next, executed the "WordCount" program for that 100 GB dataset. > The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my= calculation, total number of Maps to be executed by the wordcount job shoul= d be 100 GB / 128 MB or 102400 MB / 128 MB =3D 800. > But when I am executing the job, it is running a total number of 900 Maps,= i.e., 100 extra. > So, why this extra number of Maps? Although, my job is completing successf= ully without any error. > =20 > Again, if I don't execute the "RandomTextWwriter" job to create data for m= y wordcount, rather I put my own 100 GB text file in HDFS and run "WordCount= ", I can then see the number of Maps are equivalent to my calculation, i.e.,= 800. > =20 > Can anyone tell me why this odd behaviour of Hadoop regarding the number o= f Maps for WordCount only when the dataset is generated by RandomTextWriter?= And what is the purpose of these extra number of Maps? > =20 > Regards, > Gaurav Dasgupta --Apple-Mail-D3839967-33DB-40B6-BCA7-24B7071122EA Content-Transfer-Encoding: 7bit Content-Type: text/html; charset=utf-8
Hi Gaurav,

Did you turn off speculative execution?

Best Regards,
Anil

On Aug 16, 2012, at 7:13 AM, Gaurav Dasgupta <gdsayshi@gmail.com> wrote:

Hi users,
 
I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all the 12 nodes and 1 node running the Job Tracker).
In order to perform a WordCount benchmark test, I did the following:
  • Executed "RandomTextWriter" first to create 100 GB data (Note that I have changed the "test.randomtextwrite.total_bytes" parameter only, rest all are kept default).
  • Next, executed the "WordCount" program for that 100 GB dataset.
The "Block Size" in "hdfs-site.xml" is set as 128 MB. Now, according to my calculation, total number of Maps to be executed by the wordcount job should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
But when I am executing the job, it is running a total number of 900 Maps, i.e., 100 extra.
So, why this extra number of Maps? Although, my job is completing successfully without any error.
 
Again, if I don't execute the "RandomTextWwriter" job to create data for my wordcount, rather I put my own 100 GB text file in HDFS and run "WordCount", I can then see the number of Maps are equivalent to my calculation, i.e., 800.
 
Can anyone tell me why this odd behaviour of Hadoop regarding the number of Maps for WordCount only when the dataset is generated by RandomTextWriter? And what is the purpose of these extra number of Maps?
 
Regards,
Gaurav Dasgupta
--Apple-Mail-D3839967-33DB-40B6-BCA7-24B7071122EA--