hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Feng Jiang" <feng.a.ji...@gmail.com>
Subject Re: Some new requests about mapreduce
Date Tue, 07 Nov 2006 03:35:40 GMT
On 11/7/06, Feng Jiang <feng.a.jiang@gmail.com> wrote:
>
> Thanks for your answers.
>
> On 11/7/06, Doug Cutting <cutting@apache.org> wrote:
> >
> > Feng Jiang wrote:
> > > I think some features are very useful for us:
> > >
> > > 1. Multi-key types supported in input. for example: SEQ file A is <Ka,
> > Va>
> > > pair, and SEQ file B is <Kb, Vb> pair. I can simply add both of these
> > files
> > > as input file, and the map funtion could be map(Object, Ojbect). By
> > this
> > > way, i don't have to wrap Ka and Kb into ObjectWritable, and the
> > program
> > > will be more readable.
> >
> > This is addressed by http://issues.apache.org/jira/browse/HADOOP-372.
> >
> > > 2. Value comparator supported. There is key comparator supported in
> > current
> > > hadoop, and by this way, i can specify the order the key in reduce
> > phase.
> > > But sometimes, i also need specify the order the value sequence in
> > reduce
> > > phase. For example, values in reduce phase consist of Shop and Goods,
> > and i
> > > want to the Shop object always be the 1st object in the values because
> > the
> > > output needs shop infor. Currently i have to store the Goods Info in a
> > > buffer until the Shop object has been found.
> >
> > This is addressed by http://issues.apache.org/jira/browse/HADOOP-485 .
>
>
> I think what I am concerning is different with the request485. I mean, if
> the input of Reduce phase is :
>
> K2, V3
> K2, V2
> K1, V5
> K1, V3
> K1, V4
>
> in the current hadoop, the reduce output could be:
> K1, (V5, V3, V4)
> K2, (V3, V2)
>
> But I hope hadoop supports job.setOutputValueComparatorClass(theClass), so
> that i can make values are in order, and the output could be:
> K1, (V3, V4, V5)
> K2, (V2, V3)
>
> This feature is very important, I think. Without it, we have to take the
> sorting by ourselves, and have to worry about the possibility that the
> values are too large to fit into memory. Then the codes becomes too hard to
> read. That is the reason why i think this feature is so important, and
> should be done in the hadoop framework.
>

I have newed a request in Jira.
https://issues.apache.org/jira/browse/HADOOP-686


> 3. More effective "ObjectWritable". Look at the ObjectWritable's
> > > implementation, the class type information is always written into
> > sequence
> > > file. But in many cases, both of key and value are pretty small, the
> > class
> > > type information is even much larger than key& value themselves.
> >
> > ObjectWritable is not used so much for bulk data, but rather for small
> > items, like RPC parameters, so the size overhead is usually not an
> > issue.  Where are you finding this overhead onerous?
>
>
> I specified two sequence files as input file for MapReduce. Because
> currently hadoop doesn't support multiple key type in map phase, I have to
> wrap them into a general type. So I chose ObjectWritable. But actually, the
> data that I really want to write into file is just integer or integer pair,
> which are pretty small. but the class declaration, such as com.some.some1.
> SomeClass, becomes the big part of the output file.
>
> but I have written the GenericWritable, which is a abstract class to help
> user wrap different Writable instances with only one byte cost. The
> GenericObject is a demo showing how to use GenericWritable. Both of them are
> attached within this email.
>

> 4. Compression supported. Sequence file contains a lot of similar data, if
> >
> > > it could be compressed before it is really written into disk, a lot of
> > time
> > > will be saved. For example, if the value type is ObjectWritable, there
> > must
> > > be a lot of class declaration information could be compressed. In my
> > > experience, 20% bandwidth and disk space will be saved.
> >
> > SequenceFile already supports compression, with extensible codecs:
> >
> >
> > http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.html
> >
> > So far, this uses Java's built-in codecs.  In the next release we also
> > hope to include native support for zlib and lzo, greatly improving
> > compression performance.
> >
> > http://issues.apache.org/jira/browse/HADOOP-538
> >
> > Lzo doesn't compress quite as well as zlib, but it's much faster.  In
> > particular, zlib is generally slower than disk & net, while lzo is
> > faster.  So zlib tends to save space but not time, while lzo should save
> > both.
> >
> > Doug
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message