hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Feng Jiang" <feng.a.ji...@gmail.com>
Subject Re: Some new requests about mapreduce
Date Tue, 07 Nov 2006 02:57:21 GMT
Thanks for your answers.

On 11/7/06, Doug Cutting <cutting@apache.org> wrote:
> Feng Jiang wrote:
> > I think some features are very useful for us:
> >
> > 1. Multi-key types supported in input. for example: SEQ file A is <Ka,
> Va>
> > pair, and SEQ file B is <Kb, Vb> pair. I can simply add both of these
> files
> > as input file, and the map funtion could be map(Object, Ojbect). By this
> > way, i don't have to wrap Ka and Kb into ObjectWritable, and the program
> > will be more readable.
> This is addressed by http://issues.apache.org/jira/browse/HADOOP-372.
> > 2. Value comparator supported. There is key comparator supported in
> current
> > hadoop, and by this way, i can specify the order the key in reduce
> phase.
> > But sometimes, i also need specify the order the value sequence in
> reduce
> > phase. For example, values in reduce phase consist of Shop and Goods,
> and i
> > want to the Shop object always be the 1st object in the values because
> the
> > output needs shop infor. Currently i have to store the Goods Info in a
> > buffer until the Shop object has been found.
> This is addressed by http://issues.apache.org/jira/browse/HADOOP-485.

I think what I am concerning is different with the request485. I mean, if
the input of Reduce phase is :

K2, V3
K2, V2
K1, V5
K1, V3
K1, V4

in the current hadoop, the reduce output could be:
K1, (V5, V3, V4)
K2, (V3, V2)

But I hope hadoop supports job.setOutputValueComparatorClass(theClass), so
that i can make values are in order, and the output could be:
K1, (V3, V4, V5)
K2, (V2, V3)

This feature is very important, I think. Without it, we have to take the
sorting by ourselves, and have to worry about the possibility that the
values are too large to fit into memory. Then the codes becomes too hard to
read. That is the reason why i think this feature is so important, and
should be done in the hadoop framework.

> 3. More effective "ObjectWritable". Look at the ObjectWritable's
> > implementation, the class type information is always written into
> sequence
> > file. But in many cases, both of key and value are pretty small, the
> class
> > type information is even much larger than key& value themselves.
> ObjectWritable is not used so much for bulk data, but rather for small
> items, like RPC parameters, so the size overhead is usually not an
> issue.  Where are you finding this overhead onerous?

I specified two sequence files as input file for MapReduce. Because
currently hadoop doesn't support multiple key type in map phase, I have to
wrap them into a general type. So I chose ObjectWritable. But actually, the
data that I really want to write into file is just integer or integer pair,
which are pretty small. but the class declaration, such as com.some.some1.
SomeClass, becomes the big part of the output file.

but I have written the GenericWritable, which is a abstract class to help
user wrap different Writable instances with only one byte cost. The
GenericObject is a demo showing how to use GenericWritable. Both of them are
attached within this email.

> 4. Compression supported. Sequence file contains a lot of similar data, if
> > it could be compressed before it is really written into disk, a lot of
> time
> > will be saved. For example, if the value type is ObjectWritable, there
> must
> > be a lot of class declaration information could be compressed. In my
> > experience, 20% bandwidth and disk space will be saved.
> SequenceFile already supports compression, with extensible codecs:
> http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.html
> So far, this uses Java's built-in codecs.  In the next release we also
> hope to include native support for zlib and lzo, greatly improving
> compression performance.
> http://issues.apache.org/jira/browse/HADOOP-538
> Lzo doesn't compress quite as well as zlib, but it's much faster.  In
> particular, zlib is generally slower than disk & net, while lzo is
> faster.  So zlib tends to save space but not time, while lzo should save
> both.
> Doug

  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message