incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Supporting legacy Mapper and Reducer classes in Crunch
Date Mon, 24 Sep 2012 20:34:02 GMT
One of the ideas that Gabriel mentioned on our last epic architecture
thread has stuck w/me, and that was adding support for using a
pre-existing Mapper and Reducer class on the Crunch APIs, so that you
could do something like:

pipeline.read(From.tableSource(...))
  .parallelDo(new SomeDoFn(), ...)
  .parallelDo(mapperFn(Mapper.class), ...)
  .groupByKey()
  .parallelDo(reducerFn(Reducer.class), ...)
  .parallelDo(new OtherDoFn(), ...)
  .write(To.tableTarget(...));

This turns out to be kind of tricky to do no matter how we approach
the problem, because for this to work, we'll need to (at a minimum)
subclass the Mapper.Context and Reducer.Context classes that are
passed to the Mapper and Reducer instances, and they have different
implementations (most importantly for our purposes, different
constructors) under Hadoop 1 and 2.

It feels to me that what I need to do is create a separate subproject
that has to do some crazy stuff (e.g., use different source
directories depending on the value of the crunch.platform variable) in
order to be able to create the appropriate kind of subclass of
Mapper.Context or Reducer.Context. But this sort of thing seems like
such a bad idea that there must be some sort of less-bad option
available to me, and I wanted to solicit input before I start tilting
at this particular windmill.

Thanks!
Josh

-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Mime
View raw message