Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BB253782A for ; Mon, 1 Aug 2011 21:24:07 +0000 (UTC) Received: (qmail 13764 invoked by uid 500); 1 Aug 2011 21:24:06 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 13708 invoked by uid 500); 1 Aug 2011 21:24:06 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 13700 invoked by uid 99); 1 Aug 2011 21:24:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Aug 2011 21:24:05 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ashook@clearedgeit.com designates 64.202.165.196 as permitted sender) Received: from [64.202.165.196] (HELO smtpout04.prod.mesa1.secureserver.net) (64.202.165.196) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 01 Aug 2011 21:23:56 +0000 Received: (qmail 1231 invoked from network); 1 Aug 2011 21:23:33 -0000 Received: from unknown (173.64.96.2) by smtpout04.prod.mesa1.secureserver.net (64.202.165.196) with ESMTP; 01 Aug 2011 21:23:33 -0000 Received: from TINY.corp.clearedgeit.com ([fe80::f89f:8e50:21ff:9c4b]) by TINY.corp.clearedgeit.com ([fe80::f89f:8e50:21ff:9c4b%13]) with mapi; Mon, 1 Aug 2011 17:19:08 -0400 From: Adam Shook To: "mapreduce-user@hadoop.apache.org" Date: Mon, 1 Aug 2011 17:19:07 -0400 Subject: Unusual large number of map tasks for a SequenceFile Thread-Topic: Unusual large number of map tasks for a SequenceFile Thread-Index: AcxQkKWykjuNnhu7QAa5/WbPvHVdZA== Message-ID: <2D3A1C35D7BA764A89D1B6166D213AB04FBA261F86@TINY.corp.clearedgeit.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_2D3A1C35D7BA764A89D1B6166D213AB04FBA261F86TINYcorpclear_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_2D3A1C35D7BA764A89D1B6166D213AB04FBA261F86TINYcorpclear_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi All, I am writing a sequence file to HDFS from an application as a pre-process t= o a MapReduce job. (It isn't being written from a MR job, just open, write= , close) The file is around 32 MBs in size. When the MapReduce job starts up, it st= arts with 256 map tasks. I am writing SequenceFiles from this first job an= d firing up a second with the first job's output. The second job has aroun= d 32KB of input with 138 map tasks. There are 128 part files, so it should= only be 128 map tasks for this second job. This seems to be an unusually = large amount of map tasks since the cluster is configured to the default bl= ock size of 64MB. I am using Hadoop v0.20.1. Is there something special about how the SequenceFiles are being written? = As far as how I am using to write the first file, below is a code sample. Thanks, Adam FileSystem fs =3D FileSystem.get(new Configuration()); Writer wrtr =3D SequenceFile.createWriter(fs, fs.getConf(), ,= Text.class, Text.class); for (String s1 : strings1) { for (String s2 : strings2) { wrtr.append((new Text(s1), new Text(s2)); } } wrtr.close(); --_000_2D3A1C35D7BA764A89D1B6166D213AB04FBA261F86TINYcorpclear_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hi All,

I am writing a sequence file to HDFS fro= m an application as a pre-process to a MapReduce job.  (It isn’t= being written from a MR job, just open, write, close)

 

The file is around 32 MBs in size.&n= bsp; When the MapReduce job starts up, it starts with 256 map tasks.  = I am writing SequenceFiles from this first job and firing up a second with = the first job’s output.  The second job has around 32KB of input= with 138 map tasks.  There are 128 part files, so it should only be 1= 28 map tasks for this second job.  This seems to be an unusually large= amount of map tasks since the cluster is configured to the default block s= ize of 64MB.  I am using Hadoop v0.20.1.

 

Is there something special about how the Sequ= enceFiles are being written?  As far as how I am using to write the fi= rst file, below is a code sample.

 

Thanks,

Adam<= /p>

 

 

F= ileSystem fs =3D FileSystem.get(new Configuration());

=

Writer wrtr =3D SequenceFile= .createWriter(fs,= fs.getConf(), <path_to_file>, Text.class, Text.class);

 &nb= sp;     

for (String s1 : strings1) {

      f= or (String s2 : strings2) {

wrtr.append((new Text(s1), ne= w Text(s2));

}

}

   

wrtr.close();=

= --_000_2D3A1C35D7BA764A89D1B6166D213AB04FBA261F86TINYcorpclear_--