hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Alekseyev <dnqu...@gmail.com>
Subject Is it possible to use NullWritable in combiner? + general question about combining output from many small maps
Date Wed, 21 Jul 2010 08:59:27 GMT
Hi All,
I have a job where all processing is done by the mappers, but each
mapper produces a small file, which I want to combine into 3-4 large
ones.  In addition, I only care about the values, not the keys, so
NullWritable key is in order.  I tried using the default reducer
(which according to the docs is identity) by setting
job.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class) and
using a NullWritable key on the mapper output.  However, this seems to
concentrate the work on one reducer only.  I then tried to output
LongWritable as the mapper key, and write a combiner to output
NullWritable (i.e. class GenerateLogLineProtoCombiner extends
Reducer<LongWritable, ProtobufLineMsgWritable, NullWritable,
ProtobufLineMsgWritable>); still using the default reducer.  This gave
me the following error thrown by the combiner:

10/07/21 01:21:38 INFO mapred.JobClient: Task Id :
attempt_201007122205_1058_m_000104_2, Status : FAILED
java.io.IOException: wrong key class: class
org.apache.hadoop.io.NullWritable is not class
org.apache.hadoop.io.LongWritable
        at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:164)
          .........

I was able to get things working by explicitly putting in an identity
reducer that takes (LongWritable key, value) and outputs
(NullWritable, value).  However, now most of my processing is in the
reduce phase, which seems like a waste -- it's copying and sorting
data, but all I really need is to "glue" together the small map
outputs.

Thus, my questions are: I don't really understand why the combiner is
throwing an error here.  Does it simply not allow NullWritables on the
output?...
The second question is -- is there a standard strategy for quickly
combining the many small map outputs?  Is it worth, perhaps, to look
into adjusting the min split size for the mappers?.. (can this value
be adjusted dynamically based on the input file size?..)

Thanks to anyone who can give me some pointers :)
--Leo

Mime
View raw message