hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Malcolm Matalka" <mmata...@millennialmedia.com>
Subject .gz input files having less output than uncompressed version
Date Thu, 07 May 2009 19:05:20 GMT
Problem:

I am comparing two jobs.  The both have the same input content, however
in one job the input file has been gziped, and in the other it has not.
I get far less output rows in the gzipped result than I do in the
uncompressed version:

 

Lines in output:

Gzipped: 86851

Uncompressed: 6569303

 

The gzipped input file is 875MB in size, and the entire job runs in
about 30 seconds.  The uncompressed file takes around 5 minutes to run.

 

Hadoop version:

0.18.1, r694836

 

Here is the output of the map task of the compressed input:

2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=

2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 12

2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
io.sort.mb = 100

2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
buffer = 79691776/99614720

2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
buffer = 262144/327680

2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library

2009-05-07 14:54:54,005 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library

2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
flush of map output

2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
= 0; bufend = 45410962; bufvoid = 99614720

2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
0; kvend = 87923; length = 327680

2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
(0, 3786199, 3786199)

2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
(3786199, 3789579, 3789579)

2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
(7575778, 3859183, 3859183)

2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
(11434961, 3792449, 3792449)

2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
(15227410, 3818963, 3818963)

2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
(19046373, 3780875, 3780875)

2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
(22827248, 3814950, 3814950)

2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
(26642198, 3871426, 3871426)

2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
(30513624, 3799971, 3799971)

2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
(34313595, 3813327, 3813327)

2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
(38126922, 3835208, 3835208)

2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
(41962130, 3747048, 3747048)

2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0

2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200905071451_0001_m_000000_0: No outputs to promote from
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
_temporary/_attempt_200905071451_0001_m_000000_0

2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_200905071451_0001_m_000000_0' done.

 

 

Am I doing something wrong?  Is there anything else I can do to debug
this?  Is it a known bug?

 

Let me know if you need anything else, thanks.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message