hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@gmail.com>
Subject Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes
Date Tue, 20 May 2014 10:40:52 GMT
That's correct Andy. We're locking down the "default" primitive type
implementations going forward, while maintaining a flexible API such that
we can support existing users who want to migrate to the applicable new
features without rewriting existing data. Obviously some of those features
will depend on the new encoding semantics, but I think we can offer a net
improvement even for existing applications.


On Mon, May 19, 2014 at 6:31 AM, Andrew Purtell <andrew.purtell@gmail.com>wrote:

> So if I can summarize this thread so far, we are going to try and hammer
> out a types encoding spec agreeable to HBase, Phoenix, and Kite alike? As
> opposed to select a particular implementation today as both spec and
> reference implementation. Is that correct?
>
> If so, that sounds like a promising direction. The HBase types library has
> the flexibility, if I understand Nick correctly, to accommodate whatever is
> agreed upon and we could then provide a reference implementation as a
> service for HBase users (or anyone) but there would be no strings attached,
> multiple implementations of the spec would interoperate by definition.
>
>
> > On May 19, 2014, at 3:20 AM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
> >
> > On Thu, May 15, 2014 at 9:32 AM, James Taylor <jtaylor@salesforce.com
> >wrote:
> >
> >> @Nick - I like the abstraction of the DataType, but that doesn't solve
> the
> >> problem for non Java usage.
> >
> >
> > That's true. It's very much a Java construct. Likewise, Struct only codes
> > for semantics; there's no encoding defined there. For correct
> > multi-language support, we'll need to define these semantics the same way
> > we do the encoding details so that implementations can reproduce them
> > faithfully.
> >
> > I'm also a bit worried that it might become a bottleneck for implementors
> >> of the serialization spec as there are many different platform specific
> >> operations that will likely be done on the row key. We can try to get
> >> everything necessary in the DataType interface, but I suspect that
> >> implementors will need to go under-the-covers at times (rather than
> waiting
> >> for another release of the module that defines the DataType interface) -
> >> might become a bottleneck.
> >
> > Time will tell. DataType is just an interface, after all. If there are
> > things it's missing (as there surely are, for Phoenix...), it'll need to
> be
> > extended locally until these features can be pushed down into HBase.
> HBase
> > release managers have been faithful to the monthly release train, so I
> > think in practice dependent projects won't have to wait long. I'm content
> > to take this on a case-by-case basis and watch for a trend. Do you have
> an
> > alternative idea?
> >
> >> On Wed, May 14, 2014 at 5:17 PM, Nick Dimiduk <ndimiduk@gmail.com>
> wrote:
> >>
> >>> On Tue, May 13, 2014 at 3:35 PM, Ryan Blue <rblue@cloudera.com> wrote:
> >>>
> >>>
> >>>> I think there's a little confusion in what we are trying to
> accomplish.
> >>>> What I want to do is to write a minimal specification for how to store
> >> a
> >>>> set of types. I'm not trying to leave much flexibility, what I want
is
> >>>> clarity and simplicity.
> >>>
> >>> This is admirable and was my initial goal as well. The trouble is, you
> >>> cannot please everyone, current users and new. So, we decided it was
> >> better
> >>> to provide a pluggable framework for extension + some basic
> >> implementations
> >>> than to implement a closed system.
> >>>
> >>> This is similar to OrderedBytes work, but a subset of it. A good
> example
> >> is
> >>>> that while it's possible to use different encodings (avro, protobuf,
> >>>> thrift, ...) it isn't practical for an application to support all of
> >>> those
> >>>> encodings. So for interoperability between Kite, Phoenix, and others,
> I
> >>>> want a set of requirements that is as small as possible.
> >>>
> >>> Minimal is good. The surface area of o.a.h.h.types is as large as it is
> >>> because there was always "just one more" type to support or encoding to
> >>> provide.
> >>>
> >>> To make the requirements small, I used off-the-shelf protobuf [1] plus
> a
> >>>> small set of memcmp encodings: ints, floats, and binary. That way, we
> >>> don't
> >>>> have to talk about how to make a memcmp Date in bytes, for example.
A
> >>> Date
> >>>> is an int, which we know how to encode, and we can agree separately
on
> >>> how
> >>>> to a Date is represented (e.g., Julian vs unix epoch). [2] The same
> >>> applies
> >>>> to binary, where the encoding handles sorting and nulls, but not
> >>> charsets.
> >>>
> >>> I think you should focus on the primitives you want to support. The
> >>> compound type stuff (ie, "rowkey encodings") is a can of worms because
> >> you
> >>> need to support existing users, new users, novice users, and advanced
> >>> users. Hence the interop between the DataType interface and the Struct
> >>> classes. These work together to support all of these use-cases with the
> >>> same basic code. For example, the protobuf encoding of
> postion|wire-type
> >> +
> >>> encoded value is easily implemented using Struct.
> >>>
> >>> I firmly believe that we cannot dictate rowkey composition.
> Applications,
> >>> however, are free to implement their own. By using the common DataType
> >>> interface, they can all interoperate.
> >>>
> >>> This is the largest reason why I didn't include OrderedBytes directly
> in
> >>>> the spec. For example, OB includes a varint that I don't think is
> >>> needed. I
> >>>> don't object to its inclusion in OB, but I think it isn't a necessary
> >>>> requirement for implementing this spec.
> >>>
> >>> Again, the surface area is as it is because of community consensus
> during
> >>> the first phase of implementation. That consensus disagrees with you.
> >>>
> >>> I think there are 3 things to clear up:
> >>>> 1. What types from OB are not included, and why?
> >>>> 2. Why not use OB-style structs?
> >>>> 3. Why choose protobuf for complex records?
> >>>>
> >>>> Does that sound like a reasonable direction to head with this
> >> discussion?
> >>>
> >>> Yes, sounds great!
> >>>
> >>> As far as the DataType API, I think that works great with what I'm
> trying
> >>>> to do. We'd build a DataType implementation for the encoding and the
> >> API
> >>>> will applications handle the underlying encoding. And other encoding
> >>>> strategies can be swapped in as well, if we want to address
> >> shortcomings
> >>> in
> >>>> this one, or have another for a different use case.
> >>>
> >>> I'm quite pleased to hear that. Applications like Kite, Phoenix, Kiji
> are
> >>> the target audience of the DataType API.
> >>>
> >>> Thank you for picking back up this baton. It's sat for too long.
> >>>
> >>> -n
> >>>
> >>>> On 05/13/2014 02:33 PM, Nick Dimiduk wrote:
> >>>>
> >>>>> Breaking off hackathon thread.
> >>>>>
> >>>>> The conversation around HBASE-8089 concluded with two points:
> >>>>>  - HBase should provide support for order-preserving encodings while
> >>>>> not dropping support for the existing encoding formats.
> >>>>>  - HBase is not in the business of schema management; that is a
> >>>>> responsibility left to application developers.
> >>>>>
> >>>>> To handle the first point, OrderedBytes is provided. For the
> >> supporting
> >>>>> the second, the DataType API is introduced. By introducing this
layer
> >>>>> above specific encoding formats, it gives us a hook for plugging
in
> >>>>> different implementations and for helper utilities to ship with
> HBase,
> >>>>> such as HBASE-10091.
> >>>>>
> >>>>> Things get fuzzy around complex data types: pojos, compound rowkeys
> (a
> >>>>> special case of pojo), maps/dicts, and lists/arrays. These types
are
> >>>>> composed of other types and have different requirements based on
> where
> >>>>> in the schema they're used. Again, by falling back on the DataType
> >> API,
> >>>>> we give application developers an "out" for doing what makes the
most
> >>>>> sense for them.
> >>>>>
> >>>>> For compound rowkeys, the Struct class is designed to fill in this
> >> gap,
> >>>>> sitting between data encoding and schema expression. It gives the
> >>>>> application implementer, the person managing the schema, enough
> >>>>> flexibility express the key encoding in terms of the component types.
> >>>>> These components are not limited to the simple primitives already
> >>>>> defined, but any DataType implementation. Order preservation is
> likely
> >>>>> important here.
> >>>>>
> >>>>> For arrays/lists, there's no implementation yet, but you can see
how
> >> it
> >>>>> might be done if you have a look at struct. Order preservation may
or
> >>>>> may not be important for arrays/list.
> >>>>>
> >>>>> The situation for maps/dicts is similar to arrays/lists. The one
> >>>>> complication is the case where you want to map to a column family.
> How
> >>>>> can these APIs support this thing?
> >>>>>
> >>>>> Pojos are a little more complicated. Probably Struct is sufficient
> for
> >>>>> basic cases, but it doesn't support nice features like versioning
--
> >>>>> these are sacrificed in favor of order preservation. Luckily, there's
> >>>>> plenty of tools out there for this already: Avro, MessagePack,
> >> Protobuf,
> >>>>> Thrift, &c. There's no need to reinvent the wheel here. Application
> >>>>> developers can implement the DataType API backed by their management
> >>>>> tool of choice. I created HBASE-11161 and will post a patch shortly.
> >>>>>
> >>>>> Specific comments about the Hackathon notes inline.
> >>>>>
> >>>>> Thanks,
> >>>>> Nick
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Software Engineer
> >>>> Cloudera, Inc.
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message