hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <e...@lifeless.net>
Subject Re: avro in mapreduce
Date Fri, 29 Jan 2010 18:20:37 GMT
On 1/26/10 1:15 PM, Doug Cutting wrote:
> I would like to call folks attention to MAPREDUCE-1126.
> https://issues.apache.org/jira/browse/MAPREDUCE-1126
> This is a key link in a series of issues involved in integrating Avro in
> Mapreduce.  Aaron proposed a design in early December, building on the
> design Tom developed last summer and committed in September in
> HADOOP-6165.  Aaron's design was approved, and, after several rounds of
> reviews, I committed Aaron's patch on 11 January.
> On 15 January Owen reverted this commit without warning.  It seems that
> Owen objects to the path initiated last July in HADOOP-6165.
> Aaron has also contributed MAPREDUCE-815, which permits one to use Avro
> for all phases of Mapreduce.  When that issue is committed, the primary
> chain of Avro integration into Mapreduce will be complete.
> Can others please take the time to read this issue and express their
> opinions?

I'm coming at this from a place of far less information and commitment
to the APIs but as it stands, the existing relationship between formats
and types they support strikes me as odd.

Currently mappers and reducers are written to a contract of a particular
K/V type pair. Input and Output formats are currently aware of those
types directly and are not sanity checked until runtime given the way
format classes are specified to the job configuration. This means that a
given Mapper / Reducer implementation can be used with any format of a
particular "family" (i.e. Writable vs. Avro vs. etc.) although this is
not strongly enforced. It seems to me that if we already have this
dependency relationship that M / R implementations would, instead, be
parameterized by a serialization input / output pair rather than four
individual types and i/o formats would / could be parameterized with
serialization types.


public class MyMapper extends
  Mapper<AvroReader<String, AvroMyType>,
    WritableWriter<LongWritable, Text>> {

  public void map(AvroReader<String, AvroMyType> reader,
    Context<WritableWriter<LongWritable, Text>> context) {

    String key = reader.getKey();

      new LongWritable(Long.valueOf(key)),
      new Text(reader.getValue().toString())

(I actually don't really like the way I'm using Context here, but you
get the idea. It would be nice if the Reader / Writer thing was
represented symmetrically.)

This means we have a tangible single entity that encapsulates the
relationship between a type and how it is serialized and allows for
nicer parameterization of shared types between serialization formats.

Maybe Avro{Reader,Writer} and Writable{Reader,Writer} extend or
implement a common Serialization{Reader,Writer} that formats can
reference and to which they can defer the process of actual
serialization. This decouples formats from specific serialization
schemes and makes map / reduce implementation far more explicit (as well
as deals with Doug's case of a type that implements multiple
serialization schemes as well as the case where a serializable type need
not implement any known interface such as standard Java types).

Of course, it's a huge API change, comparatively, so I don't think this
does much for Owen's concerns. How best this is conveyed during
Configuration is unclear to me. This is why I'm raising this on list
rather than in the JIRA issue. I don't know that this solves the core
concerns and could be tangential and intractable given the internals /
effect on user facing APIs.

Just my $0.02USD (subject to market fluctuation). Flame away.

Eric Sammer

View raw message