crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Thoughts on supporting HBase 0.96
Date Wed, 16 Oct 2013 15:20:55 GMT
Ok, makes sense. And yeah, going from a Put to bytes and then back to a
Put in order to write to HBase doesn't sound too awesome.


On Wed, Oct 16, 2013 at 5:10 PM, Josh Wills <josh.wills@gmail.com> wrote:

> On Wed, Oct 16, 2013 at 8:02 AM, Gabriel Reid <gabriel.reid@gmail.com
> >wrote:
>
> > On Wed, Oct 16, 2013 at 4:34 PM, Josh Wills <jwills@cloudera.com> wrote:
> >
> > > On Wed, Oct 16, 2013 at 12:15 AM, Gabriel Reid <gabriel.reid@gmail.com
> > > >wrote:
> > >
> > > > Wouldn't a derived PType (like in o.a.c.types.PTypes) be a better fit
> > > here?
> > > >
> > >
> > > That was my initial attempt, and in an ideal world, my preferred
> > solution--
> > > but I haven't figured out how to make it work. The question here is:
> what
> > > do I derive a KeyValue object to? What I really want, for purposes of
> > > reading it/writing it to one of our HBase IO formats, is to map it to
> > > itself, and not some subclass of Writable. Another option might be an
> > > extension of WritableType to handle these special case formats-- I'll
> > take
> > > a crack at getting that to work.
> > >
> >
> > I'm sure I'm just missing something obvious, but I don't totally get it.
> > What I had
> > in my head is that KeyValue, Put, Delete, Result, etc could all be
> derived
> > to byte
> > arrays, with the KeyValueSerialization, MutationSerialization, and
> > ResultSerialization
> > classes being used in the MapFns within the derived PType to go between
> the
> > type and its byte representation, i.e.
> >
> >    public static PType<KeyValue> keyValue(PTypeFamily ptf) {
> >       return ptf.derived(
> >          KeyValue.class,
> >          BYTES_TO_KEYVALUE_VIA_KVSERIALIZATION,
> >          KEYVALUE_TO_BYTES_VIA_KVSERIALIZATION,
> >          ptf.bytes());
> >    }
> >
> > I'm guessing this is the same thing you're talking about, which I assume
> > means that
> > I'm missing something simple as to why that wouldn't just work, but I'm
> not
> > sure
> > what it is that I'm missing.
> >
> >
> The rub is the Input and Output formats, which don't expect bytes-- they
> expect either subclasses of the Mutation interface (Put or Delete), or
> KeyValue (for HFile) or Result (for HTable) inputs. So we would need to
> change the input and output formats so that they would take in bytes as
> arguments and then convert them back to the objects that the HBase APIs
> expect, so something like:
>
> getOutputMapFn() -> OutputFormat
> Put -> bytes() -> Put
>
> That isn't the end of the world, it's just a little odd. We'd need to do
> something similar on the Input format side as well, so like:
>
> InputFormat -> getInputMapFn()
> Result -> bytes() -> Result
>
>
>
> >
>
> > >
> > >
> > > > A whole new PTypeFamily sounds like a lot of work (unless maybe if it
> > > was a
> > > > subclass of one of the existing ones), and I think there's still a
> fair
> > > bit
> > > > of code
> > > > that assumes that Avro & Writable are the only two possible
> PTypeFamily
> > > > implementations.
> > > >
> > >
> > > For any kind of intermediate processing, that is still true. The
> > > HBaseTypeFamily would only ever really appear at the input or output
> for
> > a
> > > job.
> > >
> > >
> > True, although of course it would be nice if we wouldn't have that
> > limitation.
> >
> > - Gabriel
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message