hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@cloudera.com>
Subject Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes
Date Tue, 13 May 2014 22:35:55 GMT
Hi Nick,

Thanks for taking the time for a close look at this, it's great to see 
this discussion happening in depth.

I think there's a little confusion in what we are trying to accomplish. 
What I want to do is to write a minimal specification for how to store a 
set of types. I'm not trying to leave much flexibility, what I want is 
clarity and simplicity.

This is similar to OrderedBytes work, but a subset of it. A good example 
is that while it's possible to use different encodings (avro, protobuf, 
thrift, ...) it isn't practical for an application to support all of 
those encodings. So for interoperability between Kite, Phoenix, and 
others, I want a set of requirements that is as small as possible.

To make the requirements small, I used off-the-shelf protobuf [1] plus a 
small set of memcmp encodings: ints, floats, and binary. That way, we 
don't have to talk about how to make a memcmp Date in bytes, for 
example. A Date is an int, which we know how to encode, and we can agree 
separately on how to a Date is represented (e.g., Julian vs unix epoch). 
[2] The same applies to binary, where the encoding handles sorting and 
nulls, but not charsets.

This is the largest reason why I didn't include OrderedBytes directly in 
the spec. For example, OB includes a varint that I don't think is 
needed. I don't object to its inclusion in OB, but I think it isn't a 
necessary requirement for implementing this spec.

I think there are 3 things to clear up:
1. What types from OB are not included, and why?
2. Why not use OB-style structs?
3. Why choose protobuf for complex records?

Does that sound like a reasonable direction to head with this discussion?

As far as the DataType API, I think that works great with what I'm 
trying to do. We'd build a DataType implementation for the encoding and 
the API will applications handle the underlying encoding. And other 
encoding strategies can be swapped in as well, if we want to address 
shortcomings in this one, or have another for a different use case.


[1]: I think there's some confusion around the protobuf part, I'm saying 
we should use standard protobuf so we can reuse existing libraries.
[2]: We also know that a Date can be incremented, for example, because 
an int can be.

On 05/13/2014 02:33 PM, Nick Dimiduk wrote:
> Breaking off hackathon thread.
> The conversation around HBASE-8089 concluded with two points:
>   - HBase should provide support for order-preserving encodings while
> not dropping support for the existing encoding formats.
>   - HBase is not in the business of schema management; that is a
> responsibility left to application developers.
> To handle the first point, OrderedBytes is provided. For the supporting
> the second, the DataType API is introduced. By introducing this layer
> above specific encoding formats, it gives us a hook for plugging in
> different implementations and for helper utilities to ship with HBase,
> such as HBASE-10091.
> Things get fuzzy around complex data types: pojos, compound rowkeys (a
> special case of pojo), maps/dicts, and lists/arrays. These types are
> composed of other types and have different requirements based on where
> in the schema they're used. Again, by falling back on the DataType API,
> we give application developers an "out" for doing what makes the most
> sense for them.
> For compound rowkeys, the Struct class is designed to fill in this gap,
> sitting between data encoding and schema expression. It gives the
> application implementer, the person managing the schema, enough
> flexibility express the key encoding in terms of the component types.
> These components are not limited to the simple primitives already
> defined, but any DataType implementation. Order preservation is likely
> important here.
> For arrays/lists, there's no implementation yet, but you can see how it
> might be done if you have a look at struct. Order preservation may or
> may not be important for arrays/list.
> The situation for maps/dicts is similar to arrays/lists. The one
> complication is the case where you want to map to a column family. How
> can these APIs support this thing?
> Pojos are a little more complicated. Probably Struct is sufficient for
> basic cases, but it doesn't support nice features like versioning --
> these are sacrificed in favor of order preservation. Luckily, there's
> plenty of tools out there for this already: Avro, MessagePack, Protobuf,
> Thrift, &c. There's no need to reinvent the wheel here. Application
> developers can implement the DataType API backed by their management
> tool of choice. I created HBASE-11161 and will post a patch shortly.
> Specific comments about the Hackathon notes inline.
> Thanks,
> Nick

Ryan Blue
Software Engineer
Cloudera, Inc.

View raw message