hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hsieh <...@cloudera.com>
Subject Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes
Date Mon, 19 May 2014 15:06:41 GMT
On Mon, May 19, 2014 at 6:31 AM, Andrew Purtell <andrew.purtell@gmail.com>wrote:

> So if I can summarize this thread so far, we are going to try and hammer
> out a types encoding spec agreeable to HBase, Phoenix, and Kite alike? As
> opposed to select a particular implementation today as both spec and
> reference implementation. Is that correct?
>

That is the goal.   We chatted and posted notes from the discussion last
week and I believe we only have a few items to iron out now (how to encode
and handle "comples primitives" like date, and decimals.)

>
> If so, that sounds like a promising direction. The HBase types library has
> the flexibility, if I understand Nick correctly, to accommodate whatever is
> agreed upon and we could then provide a reference implementation as a
> service for HBase users (or anyone) but there would be no strings attached,
> multiple implementations of the spec would interoperate by definition.
>
>
I'll be working on a prototype in the next few weeks integrating phoenix
with a slice of the new proposed encodings and trying to use the data type
api.


>
> > On May 19, 2014, at 3:20 AM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
> >
> > On Thu, May 15, 2014 at 9:32 AM, James Taylor <jtaylor@salesforce.com
> >wrote:
> >
> >> @Nick - I like the abstraction of the DataType, but that doesn't solve
> the
> >> problem for non Java usage.
> >
> >
> > That's true. It's very much a Java construct. Likewise, Struct only codes
> > for semantics; there's no encoding defined there. For correct
> > multi-language support, we'll need to define these semantics the same way
> > we do the encoding details so that implementations can reproduce them
> > faithfully.
> >
> > I'm also a bit worried that it might become a bottleneck for implementors
> >> of the serialization spec as there are many different platform specific
> >> operations that will likely be done on the row key. We can try to get
> >> everything necessary in the DataType interface, but I suspect that
> >> implementors will need to go under-the-covers at times (rather than
> waiting
> >> for another release of the module that defines the DataType interface) -
> >> might become a bottleneck.
> >
> > Time will tell. DataType is just an interface, after all. If there are
> > things it's missing (as there surely are, for Phoenix...), it'll need to
> be
> > extended locally until these features can be pushed down into HBase.
> HBase
> > release managers have been faithful to the monthly release train, so I
> > think in practice dependent projects won't have to wait long. I'm content
> > to take this on a case-by-case basis and watch for a trend. Do you have
> an
> > alternative idea?
> >
> >> On Wed, May 14, 2014 at 5:17 PM, Nick Dimiduk <ndimiduk@gmail.com>
> wrote:
> >>
> >>> On Tue, May 13, 2014 at 3:35 PM, Ryan Blue <rblue@cloudera.com> wrote:
> >>>
> >>>
> >>>> I think there's a little confusion in what we are trying to
> accomplish.
> >>>> What I want to do is to write a minimal specification for how to store
> >> a
> >>>> set of types. I'm not trying to leave much flexibility, what I want
is
> >>>> clarity and simplicity.
> >>>
> >>> This is admirable and was my initial goal as well. The trouble is, you
> >>> cannot please everyone, current users and new. So, we decided it was
> >> better
> >>> to provide a pluggable framework for extension + some basic
> >> implementations
> >>> than to implement a closed system.
> >>>
> >>> This is similar to OrderedBytes work, but a subset of it. A good
> example
> >> is
> >>>> that while it's possible to use different encodings (avro, protobuf,
> >>>> thrift, ...) it isn't practical for an application to support all of
> >>> those
> >>>> encodings. So for interoperability between Kite, Phoenix, and others,
> I
> >>>> want a set of requirements that is as small as possible.
> >>>
> >>> Minimal is good. The surface area of o.a.h.h.types is as large as it is
> >>> because there was always "just one more" type to support or encoding to
> >>> provide.
> >>>
> >>> To make the requirements small, I used off-the-shelf protobuf [1] plus
> a
> >>>> small set of memcmp encodings: ints, floats, and binary. That way, we
> >>> don't
> >>>> have to talk about how to make a memcmp Date in bytes, for example.
A
> >>> Date
> >>>> is an int, which we know how to encode, and we can agree separately
on
> >>> how
> >>>> to a Date is represented (e.g., Julian vs unix epoch). [2] The same
> >>> applies
> >>>> to binary, where the encoding handles sorting and nulls, but not
> >>> charsets.
> >>>
> >>> I think you should focus on the primitives you want to support. The
> >>> compound type stuff (ie, "rowkey encodings") is a can of worms because
> >> you
> >>> need to support existing users, new users, novice users, and advanced
> >>> users. Hence the interop between the DataType interface and the Struct
> >>> classes. These work together to support all of these use-cases with the
> >>> same basic code. For example, the protobuf encoding of
> postion|wire-type
> >> +
> >>> encoded value is easily implemented using Struct.
> >>>
> >>> I firmly believe that we cannot dictate rowkey composition.
> Applications,
> >>> however, are free to implement their own. By using the common DataType
> >>> interface, they can all interoperate.
> >>>
> >>> This is the largest reason why I didn't include OrderedBytes directly
> in
> >>>> the spec. For example, OB includes a varint that I don't think is
> >>> needed. I
> >>>> don't object to its inclusion in OB, but I think it isn't a necessary
> >>>> requirement for implementing this spec.
> >>>
> >>> Again, the surface area is as it is because of community consensus
> during
> >>> the first phase of implementation. That consensus disagrees with you.
> >>>
> >>> I think there are 3 things to clear up:
> >>>> 1. What types from OB are not included, and why?
> >>>> 2. Why not use OB-style structs?
> >>>> 3. Why choose protobuf for complex records?
> >>>>
> >>>> Does that sound like a reasonable direction to head with this
> >> discussion?
> >>>
> >>> Yes, sounds great!
> >>>
> >>> As far as the DataType API, I think that works great with what I'm
> trying
> >>>> to do. We'd build a DataType implementation for the encoding and the
> >> API
> >>>> will applications handle the underlying encoding. And other encoding
> >>>> strategies can be swapped in as well, if we want to address
> >> shortcomings
> >>> in
> >>>> this one, or have another for a different use case.
> >>>
> >>> I'm quite pleased to hear that. Applications like Kite, Phoenix, Kiji
> are
> >>> the target audience of the DataType API.
> >>>
> >>> Thank you for picking back up this baton. It's sat for too long.
> >>>
> >>> -n
> >>>
> >>>> On 05/13/2014 02:33 PM, Nick Dimiduk wrote:
> >>>>
> >>>>> Breaking off hackathon thread.
> >>>>>
> >>>>> The conversation around HBASE-8089 concluded with two points:
> >>>>>  - HBase should provide support for order-preserving encodings while
> >>>>> not dropping support for the existing encoding formats.
> >>>>>  - HBase is not in the business of schema management; that is a
> >>>>> responsibility left to application developers.
> >>>>>
> >>>>> To handle the first point, OrderedBytes is provided. For the
> >> supporting
> >>>>> the second, the DataType API is introduced. By introducing this
layer
> >>>>> above specific encoding formats, it gives us a hook for plugging
in
> >>>>> different implementations and for helper utilities to ship with
> HBase,
> >>>>> such as HBASE-10091.
> >>>>>
> >>>>> Things get fuzzy around complex data types: pojos, compound rowkeys
> (a
> >>>>> special case of pojo), maps/dicts, and lists/arrays. These types
are
> >>>>> composed of other types and have different requirements based on
> where
> >>>>> in the schema they're used. Again, by falling back on the DataType
> >> API,
> >>>>> we give application developers an "out" for doing what makes the
most
> >>>>> sense for them.
> >>>>>
> >>>>> For compound rowkeys, the Struct class is designed to fill in this
> >> gap,
> >>>>> sitting between data encoding and schema expression. It gives the
> >>>>> application implementer, the person managing the schema, enough
> >>>>> flexibility express the key encoding in terms of the component types.
> >>>>> These components are not limited to the simple primitives already
> >>>>> defined, but any DataType implementation. Order preservation is
> likely
> >>>>> important here.
> >>>>>
> >>>>> For arrays/lists, there's no implementation yet, but you can see
how
> >> it
> >>>>> might be done if you have a look at struct. Order preservation may
or
> >>>>> may not be important for arrays/list.
> >>>>>
> >>>>> The situation for maps/dicts is similar to arrays/lists. The one
> >>>>> complication is the case where you want to map to a column family.
> How
> >>>>> can these APIs support this thing?
> >>>>>
> >>>>> Pojos are a little more complicated. Probably Struct is sufficient
> for
> >>>>> basic cases, but it doesn't support nice features like versioning
--
> >>>>> these are sacrificed in favor of order preservation. Luckily, there's
> >>>>> plenty of tools out there for this already: Avro, MessagePack,
> >> Protobuf,
> >>>>> Thrift, &c. There's no need to reinvent the wheel here. Application
> >>>>> developers can implement the DataType API backed by their management
> >>>>> tool of choice. I created HBASE-11161 and will post a patch shortly.
> >>>>>
> >>>>> Specific comments about the Hackathon notes inline.
> >>>>>
> >>>>> Thanks,
> >>>>> Nick
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Software Engineer
> >>>> Cloudera, Inc.
> >>
>



-- 
// Jonathan Hsieh (shay)
// HBase Tech Lead, Software Engineer, Cloudera
// jon@cloudera.com // @jmhsieh

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message