crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Thoughts on supporting HBase 0.96
Date Thu, 17 Oct 2013 14:13:07 GMT
On Thu, Oct 17, 2013 at 7:06 AM, Gabriel Reid <gabriel.reid@gmail.com>wrote:

> On Thu, Oct 17, 2013 at 3:38 PM, Josh Wills <jwills@cloudera.com> wrote:
>
> > My feeling is that the consensus here is that adding a new PTypeFamily
> is a
> > bad idea. :)
> >
> > The other idea I had would be to add a way for the Source and Target to
> > indicate that they were reading input data directly from the Hadoop
> > serialization framework, and thus did not need the input/output PTypes to
> > perform any additional transforms via getInputMapFn/getOutputMapFn. We
> > would still need different PTypes for working with the HBase objects
> (along
> > the lines that Gabriel mentioned earlier in the thread), but this
> approach
> > would solve the core issue w/o requiring a new PTypeFamily.
> >
>
> Is there any chance that this is just as simple as (re)implementing the
> getConverter
> method in the HBase-related Source and Target impls?
>

I hope it's only marginally more complicated than that-- the calls to
PType.getInputMapFn and PType.getOutputMapFn happen inside of the planner,
so we'll need an option that controls whether or not they are applied for a
given Source/Target.

>
>
>
>
> >
> >
> > On Wed, Oct 16, 2013 at 7:17 PM, Micah Whitacre <mkwhit@gmail.com>
> wrote:
> >
> > > If we created a new PTypeFamily we'd need to build in support to the
> > Avros
> > > (and possibly Writables) class to support wrapping the HBaseTypeFamily
> > > types.
> > >
> > >
> > > On Wed, Oct 16, 2013 at 10:20 AM, Gabriel Reid <gabriel.reid@gmail.com
> > > >wrote:
> > >
> > > > Ok, makes sense. And yeah, going from a Put to bytes and then back
> to a
> > > > Put in order to write to HBase doesn't sound too awesome.
> > > >
> > > >
> > > > On Wed, Oct 16, 2013 at 5:10 PM, Josh Wills <josh.wills@gmail.com>
> > > wrote:
> > > >
> > > > > On Wed, Oct 16, 2013 at 8:02 AM, Gabriel Reid <
> > gabriel.reid@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > On Wed, Oct 16, 2013 at 4:34 PM, Josh Wills <jwills@cloudera.com
> >
> > > > wrote:
> > > > > >
> > > > > > > On Wed, Oct 16, 2013 at 12:15 AM, Gabriel Reid <
> > > > gabriel.reid@gmail.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Wouldn't a derived PType (like in o.a.c.types.PTypes)
be a
> > better
> > > > fit
> > > > > > > here?
> > > > > > > >
> > > > > > >
> > > > > > > That was my initial attempt, and in an ideal world, my
> preferred
> > > > > > solution--
> > > > > > > but I haven't figured out how to make it work. The question
> here
> > > is:
> > > > > what
> > > > > > > do I derive a KeyValue object to? What I really want, for
> > purposes
> > > of
> > > > > > > reading it/writing it to one of our HBase IO formats, is
to map
> > it
> > > to
> > > > > > > itself, and not some subclass of Writable. Another option
might
> > be
> > > an
> > > > > > > extension of WritableType to handle these special case
> formats--
> > > I'll
> > > > > > take
> > > > > > > a crack at getting that to work.
> > > > > > >
> > > > > >
> > > > > > I'm sure I'm just missing something obvious, but I don't totally
> > get
> > > > it.
> > > > > > What I had
> > > > > > in my head is that KeyValue, Put, Delete, Result, etc could
all
> be
> > > > > derived
> > > > > > to byte
> > > > > > arrays, with the KeyValueSerialization, MutationSerialization,
> and
> > > > > > ResultSerialization
> > > > > > classes being used in the MapFns within the derived PType to
go
> > > between
> > > > > the
> > > > > > type and its byte representation, i.e.
> > > > > >
> > > > > >    public static PType<KeyValue> keyValue(PTypeFamily
ptf) {
> > > > > >       return ptf.derived(
> > > > > >          KeyValue.class,
> > > > > >          BYTES_TO_KEYVALUE_VIA_KVSERIALIZATION,
> > > > > >          KEYVALUE_TO_BYTES_VIA_KVSERIALIZATION,
> > > > > >          ptf.bytes());
> > > > > >    }
> > > > > >
> > > > > > I'm guessing this is the same thing you're talking about, which
I
> > > > assume
> > > > > > means that
> > > > > > I'm missing something simple as to why that wouldn't just work,
> but
> > > I'm
> > > > > not
> > > > > > sure
> > > > > > what it is that I'm missing.
> > > > > >
> > > > > >
> > > > > The rub is the Input and Output formats, which don't expect bytes--
> > > they
> > > > > expect either subclasses of the Mutation interface (Put or Delete),
> > or
> > > > > KeyValue (for HFile) or Result (for HTable) inputs. So we would
> need
> > to
> > > > > change the input and output formats so that they would take in
> bytes
> > as
> > > > > arguments and then convert them back to the objects that the HBase
> > APIs
> > > > > expect, so something like:
> > > > >
> > > > > getOutputMapFn() -> OutputFormat
> > > > > Put -> bytes() -> Put
> > > > >
> > > > > That isn't the end of the world, it's just a little odd. We'd need
> to
> > > do
> > > > > something similar on the Input format side as well, so like:
> > > > >
> > > > > InputFormat -> getInputMapFn()
> > > > > Result -> bytes() -> Result
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > A whole new PTypeFamily sounds like a lot of work
(unless
> maybe
> > > if
> > > > it
> > > > > > > was a
> > > > > > > > subclass of one of the existing ones), and I think
there's
> > still
> > > a
> > > > > fair
> > > > > > > bit
> > > > > > > > of code
> > > > > > > > that assumes that Avro & Writable are the only
two possible
> > > > > PTypeFamily
> > > > > > > > implementations.
> > > > > > > >
> > > > > > >
> > > > > > > For any kind of intermediate processing, that is still
true.
> The
> > > > > > > HBaseTypeFamily would only ever really appear at the input
or
> > > output
> > > > > for
> > > > > > a
> > > > > > > job.
> > > > > > >
> > > > > > >
> > > > > > True, although of course it would be nice if we wouldn't have
> that
> > > > > > limitation.
> > > > > >
> > > > > > - Gabriel
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message