hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@gmail.com>
Subject Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes
Date Thu, 15 May 2014 00:17:34 GMT
On Tue, May 13, 2014 at 3:35 PM, Ryan Blue <rblue@cloudera.com> wrote:


> I think there's a little confusion in what we are trying to accomplish.
> What I want to do is to write a minimal specification for how to store a
> set of types. I'm not trying to leave much flexibility, what I want is
> clarity and simplicity.
>

This is admirable and was my initial goal as well. The trouble is, you
cannot please everyone, current users and new. So, we decided it was better
to provide a pluggable framework for extension + some basic implementations
than to implement a closed system.

This is similar to OrderedBytes work, but a subset of it. A good example is
> that while it's possible to use different encodings (avro, protobuf,
> thrift, ...) it isn't practical for an application to support all of those
> encodings. So for interoperability between Kite, Phoenix, and others, I
> want a set of requirements that is as small as possible.
>

Minimal is good. The surface area of o.a.h.h.types is as large as it is
because there was always "just one more" type to support or encoding to
provide.

To make the requirements small, I used off-the-shelf protobuf [1] plus a
> small set of memcmp encodings: ints, floats, and binary. That way, we don't
> have to talk about how to make a memcmp Date in bytes, for example. A Date
> is an int, which we know how to encode, and we can agree separately on how
> to a Date is represented (e.g., Julian vs unix epoch). [2] The same applies
> to binary, where the encoding handles sorting and nulls, but not charsets.
>

I think you should focus on the primitives you want to support. The
compound type stuff (ie, "rowkey encodings") is a can of worms because you
need to support existing users, new users, novice users, and advanced
users. Hence the interop between the DataType interface and the Struct
classes. These work together to support all of these use-cases with the
same basic code. For example, the protobuf encoding of postion|wire-type +
encoded value is easily implemented using Struct.

I firmly believe that we cannot dictate rowkey composition. Applications,
however, are free to implement their own. By using the common DataType
interface, they can all interoperate.

This is the largest reason why I didn't include OrderedBytes directly in
> the spec. For example, OB includes a varint that I don't think is needed. I
> don't object to its inclusion in OB, but I think it isn't a necessary
> requirement for implementing this spec.
>

Again, the surface area is as it is because of community consensus during
the first phase of implementation. That consensus disagrees with you.

I think there are 3 things to clear up:
> 1. What types from OB are not included, and why?
> 2. Why not use OB-style structs?
> 3. Why choose protobuf for complex records?
>
> Does that sound like a reasonable direction to head with this discussion?
>

Yes, sounds great!

As far as the DataType API, I think that works great with what I'm trying
> to do. We'd build a DataType implementation for the encoding and the API
> will applications handle the underlying encoding. And other encoding
> strategies can be swapped in as well, if we want to address shortcomings in
> this one, or have another for a different use case.
>

I'm quite pleased to hear that. Applications like Kite, Phoenix, Kiji are
the target audience of the DataType API.

Thank you for picking back up this baton. It's sat for too long.

-n

On 05/13/2014 02:33 PM, Nick Dimiduk wrote:
>
>> Breaking off hackathon thread.
>>
>> The conversation around HBASE-8089 concluded with two points:
>>   - HBase should provide support for order-preserving encodings while
>> not dropping support for the existing encoding formats.
>>   - HBase is not in the business of schema management; that is a
>> responsibility left to application developers.
>>
>> To handle the first point, OrderedBytes is provided. For the supporting
>> the second, the DataType API is introduced. By introducing this layer
>> above specific encoding formats, it gives us a hook for plugging in
>> different implementations and for helper utilities to ship with HBase,
>> such as HBASE-10091.
>>
>> Things get fuzzy around complex data types: pojos, compound rowkeys (a
>> special case of pojo), maps/dicts, and lists/arrays. These types are
>> composed of other types and have different requirements based on where
>> in the schema they're used. Again, by falling back on the DataType API,
>> we give application developers an "out" for doing what makes the most
>> sense for them.
>>
>> For compound rowkeys, the Struct class is designed to fill in this gap,
>> sitting between data encoding and schema expression. It gives the
>> application implementer, the person managing the schema, enough
>> flexibility express the key encoding in terms of the component types.
>> These components are not limited to the simple primitives already
>> defined, but any DataType implementation. Order preservation is likely
>> important here.
>>
>> For arrays/lists, there's no implementation yet, but you can see how it
>> might be done if you have a look at struct. Order preservation may or
>> may not be important for arrays/list.
>>
>> The situation for maps/dicts is similar to arrays/lists. The one
>> complication is the case where you want to map to a column family. How
>> can these APIs support this thing?
>>
>> Pojos are a little more complicated. Probably Struct is sufficient for
>> basic cases, but it doesn't support nice features like versioning --
>> these are sacrificed in favor of order preservation. Luckily, there's
>> plenty of tools out there for this already: Avro, MessagePack, Protobuf,
>> Thrift, &c. There's no need to reinvent the wheel here. Application
>> developers can implement the DataType API backed by their management
>> tool of choice. I created HBASE-11161 and will post a patch shortly.
>>
>> Specific comments about the Hackathon notes inline.
>>
>> Thanks,
>> Nick
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message