Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 13789 invoked from network); 7 May 2009 19:05:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 May 2009 19:05:20 -0000 Received: (qmail 2389 invoked by uid 500); 7 May 2009 19:05:17 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 2320 invoked by uid 500); 7 May 2009 19:05:17 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 2310 invoked by uid 99); 7 May 2009 19:05:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 May 2009 19:05:17 +0000 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=HTML_MESSAGE,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mmatalka@millennialmedia.com designates 64.78.17.165 as permitted sender) Received: from [64.78.17.165] (HELO exsmtp012-1.exch012.intermedia.net) (64.78.17.165) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 07 May 2009 19:05:04 +0000 Received: from EXVBE012-11.exch012.intermedia.net ([207.5.74.173]) by exsmtp012-1.exch012.intermedia.net with Microsoft SMTPSVC(6.0.3790.3959); Thu, 7 May 2009 12:04:42 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C9CF46.AE16BD44" Subject: .gz input files having less output than uncompressed version Date: Thu, 7 May 2009 12:05:20 -0700 Message-ID: <4732284D6B19F34D8939C04EE064947504EF7623@EXVBE012-11.exch012.intermedia.net> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: .gz input files having less output than uncompressed version Thread-Index: AcnPRsRLT5t4gEwtQRupKKUwRG5oDw== From: "Malcolm Matalka" To: X-OriginalArrivalTime: 07 May 2009 19:04:42.0929 (UTC) FILETIME=[ADEB0210:01C9CF46] X-Virus-Checked: Checked by ClamAV on apache.org ------_=_NextPart_001_01C9CF46.AE16BD44 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Problem: I am comparing two jobs. The both have the same input content, however in one job the input file has been gziped, and in the other it has not. I get far less output rows in the gzipped result than I do in the uncompressed version: =20 Lines in output: Gzipped: 86851 Uncompressed: 6569303 =20 The gzipped input file is 875MB in size, and the entire job runs in about 30 seconds. The uncompressed file takes around 5 minutes to run. =20 Hadoop version: 0.18.1, r694836 =20 Here is the output of the map task of the compressed input: 2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=3DMAP, sessionId=3D 2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 12 2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =3D 100 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data buffer =3D 79691776/99614720 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record buffer =3D 262144/327680 2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2009-05-07 14:54:54,005 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart =3D 0; bufend =3D 45410962; bufvoid =3D 99614720 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart = =3D 0; kvend =3D 87923; length =3D 327680 2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index: (0, 3786199, 3786199) 2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index: (3786199, 3789579, 3789579) 2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index: (7575778, 3859183, 3859183) 2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index: (11434961, 3792449, 3792449) 2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index: (15227410, 3818963, 3818963) 2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index: (19046373, 3780875, 3780875) 2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index: (22827248, 3814950, 3814950) 2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index: (26642198, 3871426, 3871426) 2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index: (30513624, 3799971, 3799971) 2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index: (34313595, 3813327, 3813327) 2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index: (38126922, 3835208, 3835208) 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index: (41962130, 3747048, 3747048) 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200905071451_0001_m_000000_0: No outputs to promote from hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/ _temporary/_attempt_200905071451_0001_m_000000_0 2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_200905071451_0001_m_000000_0' done. =20 =20 Am I doing something wrong? Is there anything else I can do to debug this? Is it a known bug? =20 Let me know if you need anything else, thanks. ------_=_NextPart_001_01C9CF46.AE16BD44--